The present disclosure relates to methods, techniques, and systems for ability enhancement and, more particularly, to methods, techniques, and systems for ability enhancement in a transportation-related context by sharing threat information between devices and/or vehicles present on a roadway or in other assistance related contexts such as to provide speaker related information, language translation, or enhanced voice conferencing.
Human abilities such as hearing, vision, memory, foreign or native language comprehension, and the like may be limited for various reasons. For example, as people age, various abilities such as hearing, vision, or memory, may decline or otherwise become compromised. In some countries, as the population in general ages, such declines may become more common and widespread. In addition, young people are increasingly listening to music through headphones, which may also result in hearing loss at earlier ages.
In addition, limits on human abilities may be exposed by factors other than aging, injury, or overuse. As one example, the world population is faced with an ever increasing amount of information to review, remember, and/or integrate. Managing increasing amounts of information becomes increasingly difficult in the face of limited or declining abilities such as hearing, vision, and memory.
These problems may be further exacerbated and even result in serious health risks in a transportation-related context, as distracted and/or ability impaired drivers are more prone to be involved in accidents. For example, many drivers are increasingly distracted from the task of driving by an onslaught of information from cellular phones, smart phones, media players, navigation systems, and the like. In addition, an aging population in some regions may yield an increasing number or share of drivers who are vision and/or hearing impaired.
As another example, as the world becomes increasingly virtually and physically connected (e.g., due to improved communication and cheaper travel), people are more frequently encountering others who speak different languages. In addition, the communication technologies that support an interconnected, global economy may further expose limited human abilities. For example, it may be difficult for a user to determine who is speaking during a conference call. Even if the user is able to identify the speaker, it may still be difficult for the user to recall or access related information about the speaker and/or topics discussed during the call. Also, it may be difficult for a user to recall all of the events or information discussed during the course of a conference call or other type of conversation.
Current approaches to addressing limits on human abilities may suffer from various drawbacks. For example, there may be a social stigma connected with wearing hearing aids, corrective lenses, or similar devices. In addition, hearing aids typically perform only limited functions, such as amplifying or modulating sounds for a hearer. Furthermore, legal regimes that attempt to prohibit the use of telephones or media devices while driving may not be effective due to enforcement difficulties, declining law enforcement budgets, and the like. Nor do such regimes address a great number of other sources of distraction or impairment, such as other passengers, car radios, blinding sunlight, darkness, or the like.
As another example, current approaches to foreign language translation, such as phrase books or time-intensive language acquisition, are typically inefficient and/or unwieldy. Furthermore, existing communication technologies are not well integrated with one another, making it difficult to access information via a first device that is relevant to a conversation occurring via a second device. Also, manual note taking during the course of a conference call or other conversation may be intrusive, distracting, and/or ineffective. For example, a note-taker may not be able to accurately capture everything that was said and/or meeting notes may not be well integrated with other information sources or items that are related to the subject matter of the conference call.
FIGS. 3.1-3.78 are example flow diagrams of audible assistance processes performed by example embodiments.
FIGS. 7.1-7.81 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 11.1-11.80 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 15.1-15.108 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 19.1-19.70 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 23.1-23.94 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 27.1-27.112 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 31.1-31.132 are example flow diagrams of ability enhancement processes performed by example embodiments.
FIGS. 35.1-35.93 are example flow diagrams of ability enhancement processes performed by example embodiments.
Embodiments described herein provide enhanced computer- and network-based methods and systems for sensory augmentation and, more particularly, providing audible assistance to a user via a hearing device. Example embodiments provide an Audible Assistance Facilitator System (“AAFS”). The AAFS may augment, enhance, or improve the senses (e.g., hearing) and other faculties (e.g., memory) of a user, such as by assisting a user with the recall of names, events, communications, documents, or other information related to a speaker with whom the user is conversing. For example, when the user engages a speaker in conversation, the AAFS may “listen” to the speaker in order to identify the speaker and/or determine other speaker-related information, such as events or communications relating to the speaker and/or the user. Then, the AAFS may inform the user of the determined information, such as by “speaking” the information into an earpiece or other audio output device. The user can hear the information provided by the AAFS and advantageously use that information to avoid embarrassment (e.g., due to an inability to recall the speaker's name), engage in a more productive conversation (e.g., by quickly accessing information about events, deadlines, or communications related to the speaker), or the like.
In some embodiments, the AAFS is configured to receive data that represents an utterance of a speaker and that is obtained at or about a hearing device associated with a user. The AAFS may then identify the speaker based at least in part on the received data, such as by performing speaker recognition and/or speech recognition with the received data. The AAFS may then determine speaker-related information associated with the identified speaker, such as an identifier (e.g., name or title) of the speaker, an information item (e.g., a document, event, communication) that references the speaker, or the like. Then, the AAFS may inform the user of the determined speaker-related information by, for example, outputting audio (e.g., via text-to-speech processing) of the speaker-related information via the hearing device.
In the scenario illustrated in
The hearing device 120 receives a speech signal that represents the utterance 110, such as by receiving a digital representation of an audio signal received by a microphone of the hearing device 120. The hearing device 120 then transmits data representing the speech signal to the AAFS 100. Transmitting the data representing the speech signal may include transmitting audio samples (e.g., raw audio data), compressed audio data, speech vectors (e.g., mel frequency cepstral coefficients), and/or any other data that may be used to represent an audio signal.
The AAFS 100 then identifies the speaker based on the received data representing the speech signal. In some embodiments, identifying the speaker may include performing speaker recognition, such as by generating a “voice print” from the received data and comparing the generated voice print to previously obtained voice prints. For example, the generated voice print may be compared to multiple voice prints that are stored as audio data 130c and that each correspond to a speaker, in order to determine a speaker who has a voice that most closely matches the voice of the speaker 102. The voice prints stored as audio data 130c may be generated based on various sources of data, including data corresponding to speakers previously identified by the AAFS 100, voice mail messages, speaker enrollment data, or the like.
In some embodiments, identifying the speaker may include performing speech recognition, such as by automatically converting the received data representing the speech signal into text. The text of the speaker's utterance may then be used to identify the speaker. In particular, the text may identify one or more entities such as information items (e.g., communications, documents), events (e.g., meetings, deadlines), persons, or the like, that may be used by the AAFS 100 to identify the speaker. The information items may be accessed with reference to the messages 130a and/or documents 130b. As one example, the speaker's utterance 110 may identify an email message that was sent only to the speaker 102 and the user 104 (e.g., “That sure was a nasty email Bob sent us”). As another example, the speaker's utterance 110 may identify a meeting or other event to which both the speaker 102 and the user 104 are invited.
Note that in some cases, the text of the speaker's utterance 110 may not definitively identify the speaker 102, such as because a communication was sent to a recipients in addition to the speaker 102 and the user 104. However, in such cases the text may still be used by the AAFS 100 to narrow the set of potential speakers, and may be combined with (or used to improve) other techniques for speaker identification, including speaker recognition as discussed above.
The AAFS 100 then determines speaker-related information associated with the speaker 102. The speaker-related information may be a name or other identifier of the speaker. The speaker-related information may also or instead be other information about or related to the speaker, such as an organization of the speaker, an information item that references the speaker, an event involving the speaker, or the like. The speaker-related information may be determined with reference to the messages 130a, documents 130b, and/or audio data 130c. For example, having determined the identity of the speaker 102, the AAFS 100 may search for emails and/or documents that are stored as messages 130a and/or documents 103b and that reference (e.g., are sent to, are authored by, are named in) the speaker 102. Other types of speaker-related information is contemplated, including social networking information, such as personal or professional relationship graphs represented by a social networking service, messages or status updates sent within a social network, or the like. Social networking information may also be derived from other sources, including email lists, contact lists, communication patterns (e.g., frequent recipients of emails), or the like.
The AAFS 100 then informs the user 104 of the determined speaker-related information via the hearing device 120. Informing the user may include “speaking” the information, such as by converting textual information into audio via text-to-speech processing (e.g., speech synthesis), and then presenting the audio via a speaker (e.g., earphone, earpiece, earbud) of the hearing device 120. In the illustrated scenario, the AAFS 100 causes the hearing device 120 to make an utterance 112 by playing audio of the words “That's Bill” via a speaker (not shown) of the hearing device 120. Once the user 104 hears the utterance 112 from the hearing device 120, the user 104 responds to the speaker's original utterance 110 by with a response utterance 114 by speaking the words “Hi Bill!” As the speaker 102 and the user 104 continue to speak, the AAFS 100 may monitor the conversation and continue to determine and present speaker-related information to the user 102.
Each of the illustrated hearing devices 120 includes or may be communicatively coupled to a microphone operable to receive a speech signal from a speaker. As described above, the hearing device 120 may then convert the speech signal into data representing the speech signal, and then forward the data to the AAFS 100.
Each of the illustrated hearing devices 120 includes or may be communicatively coupled to a speaker operable to generate and output audio signals that may be perceived by the user 104. As described above, the AAFS 100 may present information to the user 104 via the hearing device 120, for example by converting a textual representation of a name or other speaker-related information into an audio representation, and then causing that audio representation to be output via a speaker of the hearing device 120.
Note that although the AAFS 100 is shown as being separate from a hearing device 120, some or all of the functions of the AAFS 100 may be performed within or by the hearing device 120 itself. For example, the smart phone hearing device 120a and/or the media device hearing device 120c may have sufficient processing power to perform all or some functions of the AAFS 100, including speaker identification (e.g., speaker recognition, speech recognition), determining speaker-related information, presenting the determined information (e.g., by way of text-to-speech processing), or the like. In some embodiments, the hearing device 120 includes logic to determine where to perform various processing tasks, so as to advantageously distribute processing between available resources, including that of the hearing device 120, other nearby devices (e.g., a laptop or other computing device of the user 104 and/or the speaker 102), remote devices (e.g., “cloud-based” processing and/or storage), and the like.
Other types of hearing devices are contemplated. For example, a land-line telephone may be configured to operate as a hearing device, so that the AAFS 100 can identify speakers who are engaged in a conference call. As another example, a hearing device may be or be part of a desktop computer, laptop computer, PDA, tablet computer, or the like.
The speech and language engine 210 includes a speech recognizer 212, a speaker recognizer 214, and a natural language processor 216. The speech recognizer 212 transforms speech audio data received from the hearing device 120 into textual representation of an utterance represented by the speech audio data. In some embodiments, the performance of the speech recognizer 212 may be improved or augmented by use of a language model (e.g., representing likelihoods of transitions between words, such as based on n-grams) or speech model (e.g., representing acoustic properties of a speaker's voice) that is tailored to or based on an identified speaker. For example, once a speaker has been identified, the speech recognizer 212 may use a language model that was previously generated based on a corpus of communications and other information items authored by the identified speaker. A speaker-specific language model may be generated based on a corpus of documents and/or messages authored by a speaker. Speaker-specific speech models may be used to account for accents or channel properties (e.g., due to environmental factors or communication equipment) that are specific to a particular speaker, and may be generated based on a corpus of recorded speech from the speaker.
The speaker recognizer 214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the hearing device 120. The speaker recognizer 214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data.
The natural language processor 216 processes text generated by the speech recognizer 212 and/or located in information items obtained from the speaker-related information sources 130. In doing so, the natural language processor 216 may identify relationships, events, or entities (e.g., people, places, things) that may facilitate speaker identification and/or other functions of the AAFS 100. For example, the natural language processor 216 may process status updates posted by the user 104 on a social networking service, to determine that the user 104 recently attended a conference in a particular city, and this fact may be used to identify a speaker and/or determine other speaker-related information.
The agent logic 220 implements the core intelligence of the AAFS 100. The agent logic 220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to identify speakers and/or determine speaker-related information. For example, the agent logic 220 may combine spoken text from the speech recognizer 212, a set of potentially matching speakers from the speaker recognizer 214, and information items from the information sources 130, in order to determine the most likely identity of the current speaker.
The presentation engine 230 includes a text-to-speech processor 232. The agent logic 220 may use or invoke the text-to-speech processor 232 in order to convert textual speaker-related information into audio output suitable for presentation via the hearing device 120.
Note that although speaker identification is herein sometimes described as including the positive identification of a single speaker, it may instead or also include determining likelihoods that each of one or more persons is the current speaker. For example, the speaker recognizer 214 may provide to the agent logic 220 indications of multiple candidate speakers, each having a corresponding likelihood. The agent logic 220 may then select the most likely candidate based on the likelihoods alone or in combination with other information, such as that provided by the speech recognizer 212, natural language processor 216, speaker-related information sources 130, or the like. In some cases, such as when there are a small number of reasonably likely candidate speakers, the agent logic 220 may inform the user 104 of the identities all of the candidate speakers (as opposed to a single speaker) candidate speaker, as such information may be sufficient to trigger the user's recall.
FIGS. 3.1-3.78 are example flow diagrams of audible assistance processes performed by example embodiments.
At block 3.101, the process performs receiving data representing a speech signal obtained at a hearing device associated with a user, the speech signal representing an utterance of a speaker.
At block 3.102, the process performs identifying the speaker based on the data representing the speech signal.
At block 3.103, the process performs determining speaker-related information associated with the identified speaker.
At block 3.104, the process performs informing the user of the speaker-related information via the hearing device.
At block 3.201, the process performs informing the user of an identifier of the speaker. In some embodiments, the identifier of the speaker may be or include a given name, surname (e.g., last name, family name), nickname, title, job description, or other type of identifier of or associated with the speaker.
At block 3.301, the process performs informing the user of information aside from identifying information related to the speaker. In some embodiments, information aside from identifying information may include information that is not a name or other identifier (e.g., job title) associated with the speaker. For example, the process may tell the user about an event or communication associated with or related to the speaker.
At block 3.401, the process performs informing the user of an organization to which the speaker belongs. In some embodiments, informing the user of an organization may include notifying the user of a business, group, school, club, team, company, or other formal or informal organization with which the speaker is affiliated.
At block 3.501, the process performs informing the user of a company associated with the speaker. Companies may include profit or non-profit entities, regardless of organizational structure (e.g., corporation, partnerships, sole proprietorship).
At block 3.601, the process performs informing the user of a previously transmitted communication referencing the speaker. Various forms of communication are contemplated, including textual (e.g., emails, text messages, chats), audio (e.g., voice messages), video, or the like. In some embodiments, a communication can include content in multiple forms, such as text and audio, such as when an email includes a voice attachment.
At block 3.701, the process performs informing the user of an email transmitted between the speaker and the user. An email transmitted between the speaker and the user may include an email sent from the speaker to the user, or vice versa.
At block 3.801, the process performs informing the user of a text message transmitted between the speaker and the user. Text messages may include short messages according to various protocols, including SMS, MMS, and the like.
At block 3.901, the process performs informing the user of an event involving the user and the speaker. An event may be any occurrence that involves or involved the user and the speaker, such as a meeting (e.g., social or professional meeting or gathering) attended by the user and the speaker, an upcoming deadline (e.g., for a project), or the like.
At block 3.1001, the process performs informing the user of a previously occurring event.
At block 3.1101, the process performs informing the user of a future event.
At block 3.1201, the process performs informing the user of a project.
At block 3.1301, the process performs informing the user of a meeting.
At block 3.1401, the process performs informing the user of a deadline.
At block 3.1501, the process performs accessing information items associated with the speaker. In some embodiments, accessing information items associated with the speaker may include retrieving files, documents, data records, or the like from various sources, such as local or remote storage devices, including cloud-based servers, and the like. In some embodiments, accessing information items may also or instead include scanning, searching, indexing, or otherwise processing information items to find ones that include, name, mention, or otherwise reference the speaker.
At block 3.1601, the process performs searching for information items that reference the speaker. In some embodiments, searching may include formulating a search query to provide to a document management system or any other data/document store that provides a search interface.
At block 3.1701, the process performs searching stored emails to find emails that reference the speaker. In some embodiments, emails that reference the speaker may include emails sent from the speaker, emails sent to the speaker, emails that name or otherwise identify the speaker in the body of an email, or the like.
At block 3.1801, the process performs searching stored text messages to find text messages that reference the speaker. In some embodiments, text messages that reference the speaker include messages sent to/from the speaker, messages that name or otherwise identify the speaker in a message body, or the like.
At block 3.1901, the process performs accessing a social networking service to find messages or status updates that reference the speaker. In some embodiments, accessing a social networking service may include searching for postings, status updates, personal messages, or the like that have been posted by, posted to, or otherwise reference the speaker. Example social networking services include Facebook, Twitter, Google Plus, and the like. Access to a social networking service may be obtained via an API or similar interface that provides access to social networking data related to the user and/or the speaker.
At block 3.2001, the process performs accessing a calendar to find information about appointments with the speaker. In some embodiments, accessing a calendar may include searching a private or shared calendar to locate a meeting or other appointment with the speaker, and providing such information to the user via the hearing device.
At block 3.2101, the process performs accessing a document store to find documents that reference the speaker. In some embodiments, documents that reference the speaker include those that are authored at least in part by the speaker, those that name or otherwise identify the speaker in a document body, or the like. Accessing the document store may include accessing a local or remote storage device/system, accessing a document management system, accessing a source control system, or the like.
At block 3.2201, the process performs performing voice identification based on the received data to identify the speaker. In some embodiments, voice identification may include generating a voice print, voice model, or other biometric feature set that characterizes the voice of the speaker, and then comparing the generated voice print to previously generated voice prints.
At block 3.2301, the process performs comparing properties of the speech signal with properties of previously recorded speech signals from multiple distinct speakers. In some embodiments, the process accesses voice prints associated with multiple speakers, and determines a best match against the speech signal.
At block 3.2401, the process performs processing voice messages from the multiple distinct speakers to generate voice print data for each of the multiple distinct speakers. Given a telephone voice message, the process may associate generated voice print data for the voice message with one or more (direct or indirect) identifiers corresponding with the message. For example, the message may have a sender telephone number associated with it, and the process can use that sender telephone number to do a reverse directory lookup (e.g., in a public directory, in a personal contact list) to determine the name of the voice message speaker.
At block 3.2501, the process performs processing telephone voice messages stored by a voice mail service. In some embodiments, the process analyzes voice messages to generate voice prints/models for multiple speakers.
At block 3.2601, the process performs performing speech recognition to convert the received data into text data. For example, the process may convert the received data into a sequence of words that are (or are likely to be) the words uttered by the speaker.
At block 3.2602, the process performs identifying the speaker based on the text data. Given text data (e.g., words spoken by the speaker), the process may search for information items that include the text data, and then identify the speaker based on those information items, as discussed further below.
At block 3.2701, the process performs finding a document that references the speaker and that includes one or more words in the text data. In some embodiments, the process may search for and find a document or other item that includes words spoken by speaker. Then, the process can infer that the speaker is the author of the document, a recipient of the document, a person described in the document, or the like.
At block 3.2801, the process performs performing speech recognition based on cepstral coefficients that represent the speech signal. In other embodiments, other types of features or information may be also or instead used to perform speech recognition, including language models, dialect models, or the like.
At block 3.2901, the process performs performing hidden Markov model-based speech recognition. Other approaches or techniques for speech recognition may include neural networks, stochastic modeling, or the like.
At block 3.3001, the process performs retrieving information items that reference the text data. The process may here retrieve or otherwise obtain documents, calendar events, messages, or the like, that include, contain, or otherwise reference some portion of the text data.
At block 3.3002, the process performs informing the user of the retrieved information items.
At block 3.3101, the process performs converting the text data into audio data that represents a voice of a different speaker. In some embodiments, the process may perform this conversion by performing text-to-speech processing to read the text data in a different voice.
At block 3.3102, the process performs causing the audio data to be played through the hearing device.
At block 3.3201, the process performs performing speech recognition based at least in part on a language model associated with the speaker. A language model may be used to improve or enhance speech recognition. For example, the language model may represent word transition likelihoods (e.g., by way of n-grams) that can be advantageously employed to enhance speech recognition. Furthermore, such a language model may be speaker specific, in that it may be based on communications or other information generated by the speaker.
At block 3.3301, the process performs generating the language model based on communications generated by the speaker. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like to generate a language model that is specific or otherwise tailored to the speaker.
At block 3.3401, the process performs generating the language model based on emails transmitted by the speaker.
At block 3.3501, the process performs generating the language model based on documents authored by the speaker.
At block 3.3601, the process performs generating the language model based on social network messages transmitted by the speaker.
At block 3.3701, the process performs receiving data representing a speech signal that represents an utterance of the user. A microphone on or about the hearing device may capture this data. The microphone may be the same or different from one used to capture speech data from the speaker.
At block 3.3702, the process performs identifying the speaker based on the data representing a speech signal that represents an utterance of the user. Identifying the speaker in this manner may include performing speech recognition on the user's utterance, and then processing the resulting text data to locate a name. This identification can then be utilized to retrieve information items or other speaker-related information that may be useful to present to the user.
At block 3.3801, the process performs determining whether the utterance of the user includes a name of the speaker.
At block 3.3901, the process performs receiving context information related to the user. Context information may generally include information about the setting, location, occupation, communication, workflow, or other event or factor that is present at, about, or with respect to the user.
At block 3.3902, the process performs identifying the speaker, based on the context information. Context information may be used to improve or enhance speaker identification, such as by determining or narrowing a set of potential speakers based on the current location of the user
At block 3.4001, the process performs receiving an indication of a location of the user.
At block 3.4002, the process performs determining a plurality of persons with whom the user commonly interacts at the location. For example, if the indicated location is a workplace, the process may generate a list of co-workers, thereby reducing or simplifying the problem of speaker identification.
At block 3.4101, the process performs receiving a GPS location from a mobile device of the user.
At block 3.4201, the process performs receiving a network identifier that is associated with the location. The network identifier may be, for example, a service set identifier (“SSID”) of a wireless network with which the user is currently associated.
At block 3.4301, the process performs receiving an indication that the user is at a workplace. For example, the process may translate a coordinate-based location (e.g., GPS coordinates) to a particular workplace by performing a map lookup or other mechanism.
At block 3.4401, the process performs receiving an indication that the user is at a residence.
At block 3.4501, the process performs receiving information about a communication that references the speaker. As noted, context information may include communications. In this case, the process may exploit such communications to improve speaker identification or other operations.
At block 3.4601, the process performs receiving information about a message that references the speaker.
At block 3.4701, the process performs receiving information about a document that references the speaker.
At block 3.4801, the process performs receiving data representing an ongoing conversation amongst multiple speakers. In some embodiments, the process is operable to identify multiple distinct speakers, such as when a group is meeting via a conference call.
At block 3.4802, the process performs identifying the multiple speakers based on the data representing the ongoing conversation.
At block 3.4803, the process performs as each of the multiple speakers takes a turn speaking during the ongoing conversation, informing the user of a name or other speaker-related information associated with the speaker. In this manner, the process may, in substantially real time, provide the user with indications of a current speaker, even though such a speaker may not be visible or even previously known to the user.
At block 3.4901, the process performs developing a corpus of speaker data by recording speech from a plurality of speakers.
At block 3.4902, the process performs identifying the speaker based at least in part on the corpus of speaker data. Over time, the process may gather and record speech obtained during its operation, and then use that speech as part of a corpus that is used during future operation. In this manner, the process may improve its performance by utilizing actual, environmental speech data, possibly along with feedback received from the user, as discussed below.
At block 3.5001, the process performs generating a speech model associated with each of the plurality of speakers, based on the recorded speech. The generated speech model may include voice print data that can be used for speaker identification, a language model that may be used for speech recognition purposes, a noise model that may be used to improve operation in speaker-specific noisy environments.
At block 3.5101, the process performs receiving feedback regarding accuracy of the speaker-related information. During or after providing speaker-related information to the user, the user may provide feedback regarding its accuracy. This feedback may then be used to train a speech processor (e.g., a speaker identification module, a speech recognition module). Feedback may be provided in various ways, such as by processing positive/negative utterances from the speaker (e.g., “That is not my name”), receiving a positive/negative utterance from the user (e.g., “I am sorry.”), receiving a keyboard/button event that indicates a correct or incorrect identification.
At block 3.5102, the process performs training a speech processor based at least in part on the received feedback.
At block 3.5201, the process performs transmitting the speaker-related information to a hearing device configured to amplify speech for the user. In some embodiments, the hearing device may be a hearing aid or similar device that is configured to amplify or otherwise modulate audio signals for the user.
At block 3.5301, the process performs transmitting the speaker-related information to the hearing device from a computing system that is remote from the hearing device. In some embodiments, at least some of the processing performed remote from the hearing device, such that the speaker-related information is transmitted to the hearing device.
At block 3.5401, the process performs transmitting the speaker-related information from a mobile device that is operated by the user and that is in communication with the hearing device. For example, the hearing device may be a headset or earpiece that communicates with a mobile device (e.g., smart phone) operated by the user.
At block 3.5501, the process performs wirelessly transmitting the speaker-related information from the mobile device to the hearing device. Various protocols may be used, including Bluetooth, infrared, WiFi, or the like.
At block 3.5601, the process performs transmitting the speaker-related information from a smart phone to the hearing device.
At block 3.5701, the process performs transmitting the speaker-related information from a portable media player to the hearing device.
At block 3.5801, the process performs transmitting the speaker-related information from a server system. In some embodiments, some portion of the processing is performed on a server system that may be remote from the hearing device.
At block 3.5901, the process performs transmitting the speaker-related information from a server system that resides in a data center.
At block 3.6001, the process performs transmitting the speaker-related information to earphones in communication with a mobile device that is operating as the hearing device.
At block 3.6101, the process performs transmitting the speaker-related information to earbuds in communication with a mobile device that is operating as the hearing device.
At block 3.6201, the process performs transmitting the speaker-related information to a headset in communication with a mobile device that is operating as the hearing device.
At block 3.6301, the process performs transmitting the speaker-related information to a pillow speaker in communication with a mobile device that is operating as the hearing device.
At block 3.6401, the process performs identifying the speaker, performed on a mobile device that is operated by the user. As noted, In some embodiments a mobile device such as a smart phone may have sufficient processing power to perform a portion of the process, such as identifying the speaker.
At block 3.6501, the process performs identifying the speaker, performed on a smart phone that is operated by the user.
At block 3.6601, the process performs identifying the speaker, performed on a media device that is operated by the user.
At block 3.6701, the process performs determining speaker-related information, performed on a mobile device that is operated by the user.
At block 3.6801, the process performs determining speaker-related information, performed on a smart phone that is operated by the user.
At block 3.6901, the process performs determining speaker-related information, performed on a media device that is operated by the user.
At block 3.7001, the process performs determining whether or not the user can name the speaker.
At block 3.7002, the process performs when it is determined that the user cannot name the speaker, informing the user of the speaker-related information via the hearing device. In some embodiments, the process only informs the user of the speaker-related information upon determining that the speaker does not appear to be able to name the speaker.
At block 3.7101, the process performs determining whether the user has named the speaker. In some embodiments, the process listens to the user to determine whether the user has named the speaker.
At block 3.7201, the process performs determining whether the speaker has uttered a given name or surname of the speaker.
At block 3.7301, the process performs determining whether the speaker has uttered a nickname of the speaker.
At block 3.7401, the process performs determining whether the speaker has uttered a name of a relationship between the user and the speaker. In some embodiments, the user need not utter the name of the speaker, but instead may utter other information (e.g., a relationship) that may be used by the process to determine that user knows or can name the speaker.
At block 3.7501, the process performs determining whether the user has uttered information that is related to both the speaker and the user.
At block 3.7601, the process performs determining whether the user has named a person, place, thing, or event that the speaker and the user have in common. For example, the user may mention a visit to the home town of the speaker, a vacation to a place familiar to the speaker, or the like.
At block 3.7701, the process performs performing speech recognition to convert an utterance of the user into text data.
At block 3.7702, the process performs determining whether or not the user can name the speaker based at least in part on the text data.
At block 3.7801, the process performs when the user does not name the speaker within a predetermined time interval, determining that the user cannot name the speaker. In some embodiments, the process waits for a time period before jumping in to provide the speaker-related information.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AAFS 100. In addition, the computing system 400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AAFS 100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 400 comprises a computer memory (“memory”) 401, a display 402, one or more Central Processing Units (“CPU”) 403, Input/Output devices 404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 405, and network connections 406. The AAFS 100 is shown residing in memory 401. In other embodiments, some portion of the contents, some or all of the components of the AAFS 100 may be stored on and/or transmitted over the other computer-readable media 405. The components of the AAFS 100 preferably execute on one or more CPUs 403 and recommend content items, as described herein. Other code or programs 430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 420, also reside in the memory 401, and preferably execute on one or more CPUs 403. Of note, one or more of the components in
The AAFS 100 interacts via the network 450 with hearing devices 120, speaker-related information sources 130, and third-party systems/applications 455. The network 450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 455 may include any systems that provide data to, or utilize data from, the AAFS 100, including Web browsers, e-commerce sites, calendar applications, email systems, social networking services, and the like.
The AAFS 100 is shown executing in the memory 401 of the computing system 400. Also included in the memory are a user interface manager 415 and an application program interface (“API”) 416. The user interface manager 415 and the API 416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AAFS 100.
The UI manager 415 provides a view and a controller that facilitate user interaction with the AAFS 100 and its various components. For example, the UI manager 415 may provide interactive access to the AAFS 100, such that users can configure the operation of the AAFS 100, such as by providing the AAFS 100 credentials to access various sources of speaker-related information, including social networking services, email systems, document stores, or the like. In some embodiments, access to the functionality of the UI manager 415 may be provided via a Web server, possibly executing as one of the other programs 430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 455 can interact with the AAFS 100 via the UI manager 415.
The API 416 provides programmatic access to one or more functions of the AAFS 100. For example, the API 416 may provide a programmatic interface to one or more functions of the AAFS 100 that may be invoked by one of the other programs 430 or some other module. In this manner, the API 416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AAFS 100 into Web applications), and the like.
In addition, the API 416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the hearing devices 120, information sources 130, and/or one of the third-party systems/applications 455, to access various functions of the AAFS 100. For example, an information source 130 may push speaker-related information (e.g., emails, documents, calendar events) to the AAFS 100 via the API 416. The API 416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 455 and that are configured to interact with the AAFS 100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AAFS 100 are implemented using standard programming techniques. For example, the AAFS 100 may be implemented as a “native” executable running on the CPU 403, along with one or more static or dynamic libraries. In other embodiments, the AAFS 100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AAFS 100, such as in the data store 417, can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 417 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AAFS 100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for ability enhancement and, more particularly, determining and presenting speaker-related information based on speaker utterances received by, for example, a hearing device. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”). The AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory), and/or other abilities of a user, such as by assisting a user with the recall of names, events, communications, documents, or other information related to a speaker with whom the user is conversing. For example, when the user engages a speaker in conversation, the AEFS may “listen” to the speaker in order to identify the speaker and/or determine other speaker-related information, such as events or communications relating to the speaker and/or the user. Then, the AEFS may inform the user of the determined information, such as by visually presenting the information on a display screen or other visual output device. The user can then read the information provided by the AEFS and advantageously use that information to avoid embarrassment (e.g., due to an inability to recall the speaker's name), engage in a more productive conversation (e.g., by quickly accessing information about events, deadlines, or communications related to the speaker), or the like.
In some embodiments, the AEFS is configured to receive data that represents an utterance of a speaker and that is obtained at or about a hearing device associated with a user. The hearing device may be or include any device that is used by the user to hear sounds, including a hearing aid, a personal media device/player, a telephone, or the like. The AEFS may then identify the speaker based at least in part on the received data, such as by performing speaker recognition and/or speech recognition with the received data. The AEFS may then determine speaker-related information associated with the identified speaker, such as an identifier (e.g., name or title) of the speaker, an information item (e.g., a document, event, communication) that references the speaker, or the like. Then, the AEFS may inform the user of the determined speaker-related information by, for example, visually presenting the speaker-related information via a visual display device. In some embodiments, the visual display device may be part of the hearing device, such as a screen on a personal media player. In some embodiments, the visual display device may be separate from the hearing device. For example, the visual display device may be a screen on a laptop computer whilst the hearing device is a hearing aid worn by the user.
In the scenario illustrated in
The hearing device 5.120 receives a speech signal that represents the utterance 5.110, such as by receiving a digital representation of an audio signal received by a microphone of the hearing device 5.120. The hearing device 5.120 then transmits data representing the speech signal to the AEFS 5.100. Transmitting the data representing the speech signal may include transmitting audio samples (e.g., raw audio data), compressed audio data, speech vectors (e.g., mel frequency cepstral coefficients), and/or any other data that may be used to represent an audio signal.
The AEFS 5.100 then identifies the speaker based on the received data representing the speech signal. In some embodiments, identifying the speaker may include performing speaker recognition, such as by generating a “voice print” from the received data and comparing the generated voice print to previously obtained voice prints. For example, the generated voice print may be compared to multiple voice prints that are stored as audio data 5.130c and that each correspond to a speaker, in order to determine a speaker who has a voice that most closely matches the voice of the speaker 5.102. The voice prints stored as audio data 5.130c may be generated based on various sources of data, including data corresponding to speakers previously identified by the AEFS 5.100, voice mail messages, speaker enrollment data, or the like.
In some embodiments, identifying the speaker may include performing speech recognition, such as by automatically converting the received data representing the speech signal into text. The text of the speaker's utterance may then be used to identify the speaker. In particular, the text may identify one or more entities such as information items (e.g., communications, documents), events (e.g., meetings, deadlines), persons, or the like, that may be used by the AEFS 5.100 to identify the speaker. The information items may be accessed with reference to the messages 5.130a and/or documents 5.130b. As one example, the speaker's utterance 5.110 may identify an email message that was sent to the speaker 5.102 and the user 5.104 (e.g., “That sure was a nasty email Bob sent us”). As another example, the speaker's utterance 5.110 may identify a meeting or other event to which both the speaker 5.102 and the user 5.104 are invited.
Note that in some cases, the text of the speaker's utterance 5.110 may not definitively identify the speaker 5.102, such as because a communication was sent to a recipients in addition to the speaker 5.102 and the user 5.104. However, in such cases the text may still be used by the AEFS 5.100 to narrow the set of potential speakers, and may be combined with (or used to improve) other techniques for speaker identification, including speaker recognition as discussed above.
The AEFS 5.100 then determines speaker-related information associated with the speaker 5.102. The speaker-related information may be a name or other identifier of the speaker. The speaker-related information may also or instead be other information about or related to the speaker, such as an organization of the speaker, an information item that references the speaker, an event involving the speaker, or the like. The speaker-related information may be determined with reference to the messages 5.130a, documents 5.130b, and/or audio data 5.130c. For example, having determined the identity of the speaker 5.102, the AEFS 5.100 may search for emails and/or documents that are stored as messages 5.130a and/or documents 5.103b and that reference (e.g., are sent to, are authored by, are named in) the speaker 5.102.
Other types of speaker-related information is contemplated, including social networking information, such as personal or professional relationship graphs represented by a social networking service, messages or status updates sent within a social network, or the like. Social networking information may also be derived from other sources, including email lists, contact lists, communication patterns (e.g., frequent recipients of emails), or the like.
The AEFS 5.100 then informs the user 5.104 of the determined speaker-related information. Informing the user may include visually presenting the information, such as on the display 5.121 of hearing device 5.120. In the illustrated example, the AEFS 5.100 causes a message 5.112 that includes the text “That's Bill” to be displayed on the display 5.121. Upon reading the message 5.112 and thereby learning the identity of the speaker 5.102, the user 5.104 responds to the speaker's original utterance 5.110 by with a response utterance 5.114 by speaking the words “Hi Bill!” As the speaker 5.102 and the user 5.104 continue to speak, the AEFS 5.100 may monitor the conversation and continue to determine and present speaker-related information to the user 5.102.
The AEFS 5.100 may cause speaker-related information to be displayed in various ways or places. In some embodiments, the AEFS 5.100 may use a display of a hearing device as a target for displaying speaker-related information. For example, the AEFS 5.100 may display speaker-related information on the display 5.121 of the smart phone 5.120a. When the hearing device does not have its own display, such as hearing aid device 5.120b, the AEFS 5.100 may display speaker-related information on some other destination display that is accessible to the user 5.104. For example, when the hearing aid device 5.120b is the hearing device and the user also has the personal media player 5.120c in his possession, the AEFS 5.100 may elect to display speaker-related information upon the display 5.123 of the personal media player 5.120c.
The AEFS 5.100 may determine a destination display for speaker-related information. In some embodiments, determining a destination display may include selecting from one of multiple possible destination displays based on whether a display is capable of displaying all of the speaker-related information. For example, if the user 5.104 is proximate to a first display that is capable of displaying only text and a second display capable of displaying graphics, the AEFS 5.100 may select the second display when the speaker-related information includes graphics content (e.g., an image). In some embodiments, determining a destination display may include selecting from one of multiple possible destination displays based on the size of each display. For example, a small LCD display (such as may be found on a mobile phone) may be suitable for displaying speaker-related information that is just a few characters (e.g., a name) but not be suitable for displaying an entire email message or large document. Note that the AEFS 5.100 may select between multiple potential target displays even when the hearing device itself includes its own display.
Determining a destination display may be based on other or additional factors. In some embodiments, the AEFS 5.100 may use user preferences that have been inferred (e.g., based on current or prior interactions with the user 5.104) and/or explicitly provided by the user. For example, the AEFS 5.100 may determine to present an email or other speaker-related information onto the display 5.121 of the smart phone 5.120a based on the fact that the user 5.104 is currently interacting with the smart phone 5.120a.
In some embodiments, the AEFS 5.100 may also use audio signals to interact with the user 5.104. In particular, each of the illustrated hearing devices 5.120 may include or be communicatively coupled to a speaker operable to generate and output audio signals that may be perceived by the user 5.104. The AEFS 5.100 may audibly notify, via a speaker of a hearing device 5.120, the user 5.104 to view speaker-related information displayed on the hearing device 5.120. For example, the AEFS 5.100 may cause a tone (e.g., beep, chime) to be played via the earphones 5.124 of the personal media player hearing device 5.120c. Such a tone may then be recognized by the user 5.104, who will in response attend to information displayed on the display 5.123. Such audible notification may be used to identify a display that is being used as a current display, such as when multiple displays are being used. For example, different first and second tones may be used to direct the user's attention to a desktop display and a smart phone display, respectively. In some embodiments, audible notification may include playing synthesized speech (e.g., from text-to-speech processing) telling the user 5.104 to view speaker-related information on a particular display device (e.g., “Recent email on your smart phone”).
Note that although the AEFS 5.100 is shown as being separate from a hearing device 5.120, some or all of the functions of the AEFS 5.100 may be performed within or by the hearing device 5.120 itself. For example, the smart phone hearing device 5.120a and/or the media player hearing device 5.120c may have sufficient processing power to perform all or some functions of the AEFS 5.100, including speaker identification (e.g., speaker recognition, speech recognition), determining speaker-related information, presenting the determined information, or the like. In some embodiments, the hearing device 5.120 includes logic to determine where to perform various processing tasks, so as to advantageously distribute processing between available resources, including that of the hearing device 5.120, other nearby devices (e.g., a laptop or other computing device of the user 5.104 and/or the speaker 5.102), remote devices (e.g., “cloud-based” processing and/or storage), and the like.
Other types of hearing devices are contemplated. For example, a land-line telephone may be configured to operate as a hearing device, so that the AEFS 5.100 can determine speaker-related information about speakers who are engaged in a conference call. As another example, a hearing device may be or be part of a desktop computer, laptop computer, PDA, tablet computer, or the like.
The speech and language engine 6.210 includes a speech recognizer 6.212, a speaker recognizer 6.214, and a natural language processor 6.216. The speech recognizer 6.212 transforms speech audio data received from the hearing device 5.120 into textual representation of an utterance represented by the speech audio data. In some embodiments, the performance of the speech recognizer 6.212 may be improved or augmented by use of a language model (e.g., representing likelihoods of transitions between words, such as based on n-grams) or speech model (e.g., representing acoustic properties of a speaker's voice) that is tailored to or based on an identified speaker. For example, once a speaker has been identified, the speech recognizer 6.212 may use a language model that was previously generated based on a corpus of communications and other information items authored by the identified speaker. A speaker-specific language model may be generated based on a corpus of documents and/or messages authored by a speaker. Speaker-specific speech models may be used to account for accents or channel properties (e.g., due to environmental factors or communication equipment) that are specific to a particular speaker, and may be generated based on a corpus of recorded speech from the speaker.
The speaker recognizer 6.214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the hearing device 5.120. The speaker recognizer 6.214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 6.240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data.
The natural language processor 6.216 processes text generated by the speech recognizer 6.212 and/or located in information items obtained from the speaker-related information sources 5.130. In doing so, the natural language processor 6.216 may identify relationships, events, or entities (e.g., people, places, things) that may facilitate speaker identification and/or other functions of the AEFS 5.100. For example, the natural language processor 6.216 may process status updates posted by the user 5.104 on a social networking service, to determine that the user 5.104 recently attended a conference in a particular city, and this fact may be used to identify a speaker and/or determine other speaker-related information.
The agent logic 6.220 implements the core intelligence of the AEFS 5.100. The agent logic 6.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to identify speakers and/or determine speaker-related information. For example, the agent logic 6.220 may combine spoken text from the speech recognizer 6.212, a set of potentially matching speakers from the speaker recognizer 6.214, and information items from the information sources 5.130, in order to determine the most likely identity of the current speaker.
The presentation engine 6.230 includes a visible output processor 6.232 and an audible output processor 6.234. The visible output processor 6.232 may prepare, format, and/or cause speaker-related information to be displayed on a display device, such as a display of the hearing device 5.120 or some other display (e.g., a desktop or laptop display in proximity to the user 5.104). The agent logic 6.220 may use or invoke the visible output processor 6.232 to prepare and display speaker-related information, such as by formatting or otherwise modifying the speaker-related information to fit on a particular type or size of display. The audible output processor 6.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 6.220 may use or invoke the audible output processor 6.234 in order to convert textual speaker-related information into audio output suitable for presentation via the hearing device 5.120, for example by employing a text-to-speech processor.
Note that although speaker identification is herein sometimes described as including the positive identification of a single speaker, it may instead or also include determining likelihoods that each of one or more persons is the current speaker. For example, the speaker recognizer 6.214 may provide to the agent logic 6.220 indications of multiple candidate speakers, each having a corresponding likelihood. The agent logic 6.220 may then select the most likely candidate based on the likelihoods alone or in combination with other information, such as that provided by the speech recognizer 6.212, natural language processor 6.216, speaker-related information sources 5.130, or the like. In some cases, such as when there are a small number of reasonably likely candidate speakers, the agent logic 6.220 may inform the user 5.104 of the identities all of the candidate speakers (as opposed to a single speaker) candidate speaker, as such information may be sufficient to trigger the user's recall.
FIGS. 7.1-7.81 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 7.101, the process performs receiving data representing a speech signal obtained at a hearing device associated with a user, the speech signal representing an utterance of a speaker. The received data may be or represent the speech signal itself (e.g., audio samples) and/or higher-order information (e.g., frequency coefficients). The data may be received by or at the hearing device 5.120 and/or the AEFS 5.100.
At block 7.102, the process performs identifying the speaker based on the data representing the speech signal. Identifying the speaker may be based on signal properties of the speech signal (e.g., a voice print) and/or on the content of the utterance, such as a name, event, entity, or information item that was mentioned by the speaker and that can be used to infer the identity of the speaker.
At block 7.103, the process performs determining speaker-related information associated with the identified speaker. The speaker-related information may include identifiers of the speaker (e.g., names, titles) and/or related information, including information items that reference the speaker, such as documents, emails, calendar events, or the like.
At block 7.104, the process performs visually presenting the speaker-related information to the user. The speaker-related information may be presented on a display of the hearing device (if it has one) or on some other display, such as a laptop or desktop display that is proximately located to the user.
At block 7.201, the process performs presenting the speaker-related information on a display of the hearing device. In some embodiments, the hearing device may include a display. For example, where the hearing device is a smart phone or media player/device, the hearing device may include a display that provides a suitable medium for presenting the name or other identifier of the speaker.
At block 7.301, the process performs presenting the speaker-related information on a display of a computing device that is distinct from the hearing device. In some embodiments, the hearing device may not itself include a display. For example, where the hearing device is an office phone, the process may elect to present the speaker-related information on a display of a nearby computing device, such as a desktop or laptop computer in the vicinity of the phone.
At block 7.401, the process performs determining a display to serve as a destination for the speaker-related information. In some embodiments, there may be multiple displays available as possible destinations for the speaker-related information. For example, in an office setting, where the hearing device is an office phone, the office phone may include a small LCD display suitable for displaying a few characters or at most a few lines of text. However, there will typically be additional devices in the vicinity of the hearing device, such as a desktop/laptop computer, a smart phone, a PDA, or the like. The process may determine to use one or more of these other display devices, possibly based on the type of the speaker-related information being displayed.
At block 7.501, the process performs selecting from one of multiple displays, based at least in part on whether each of the multiple displays is capable of displaying all of the speaker-related information. In some embodiments, the process determines whether all of the speaker-related information can be displayed on a given display. For example, where the display is a small alphanumeric display on an office phone, the process may determine that the display is not capable of displaying a large amount of speaker-related information.
At block 7.601, the process performs selecting from one of multiple displays, based at least in part on a size of each of the multiple displays. In some embodiments, the process considers the size (e.g., the number of characters or pixels that can be displayed) of each display.
At block 7.701, the process performs selecting from one of multiple displays, based at least in part on whether each of the multiple displays is suitable for displaying the speaker-related information, the speaker-related information being at least one of text information, a communication, a document, an image, and/or a calendar event. In some embodiments, the process considers the type of the speaker-related information. For example, whereas a small alphanumeric display on an office phone may be suitable for displaying the name of the speaker, it would not be suitable for displaying an email message sent by the speaker.
At block 7.801, the process performs audibly notifying the user to view the speaker-related information on a display device.
At block 7.901, the process performs playing a tone via an audio speaker of the hearing device. The tone may include a beep, chime, or other type of notification.
At block 7.1001, the process performs playing synthesized speech via an audio speaker of the hearing device, the synthesized speech telling the user to view the display device. In some embodiments, the process may perform text-to-speech processing to generate audio of a textual message or notification, and this audio may then be played or otherwise output to the user via the hearing device.
At block 7.1101, the process performs telling the user that at least one of a document, a calendar event, and/or a communication is available for viewing on the display device. Telling the user about a document or other speaker-related information may include playing synthesized speech that includes an utterance to that effect.
At block 7.1201, the process performs audibly notifying the user in a manner that is not audible to the speaker. For example, a tone or verbal message may be output via an earpiece speaker, such that other parties to the conversation (including the speaker) do not hear the notification. As another example, a tone or other notification may be into the earpiece of a telephone, such as when the process is performing its functions within the context of a telephonic conference call.
At block 7.1301, the process performs informing the user of an identifier of the speaker. In some embodiments, the identifier of the speaker may be or include a given name, surname (e.g., last name, family name), nickname, title, job description, or other type of identifier of or associated with the speaker.
At block 7.1401, the process performs informing the user of information aside from identifying information related to the speaker. In some embodiments, information aside from identifying information may include information that is not a name or other identifier (e.g., job title) associated with the speaker. For example, the process may tell the user about an event or communication associated with or related to the speaker.
At block 7.1501, the process performs informing the user of an organization to which the speaker belongs. In some embodiments, informing the user of an organization may include notifying the user of a business, group, school, club, team, company, or other formal or informal organization with which the speaker is affiliated.
At block 7.1601, the process performs informing the user of a company associated with the speaker. Companies may include profit or non-profit entities, regardless of organizational structure (e.g., corporation, partnerships, sole proprietorship).
At block 7.1701, the process performs informing the user of a previously transmitted communication referencing the speaker. Various forms of communication are contemplated, including textual (e.g., emails, text messages, chats), audio (e.g., voice messages), video, or the like. In some embodiments, a communication can include content in multiple forms, such as text and audio, such as when an email includes a voice attachment.
At block 7.1801, the process performs informing the user of an email transmitted between the speaker and the user. An email transmitted between the speaker and the user may include an email sent from the speaker to the user, or vice versa.
At block 7.1901, the process performs informing the user of a text message transmitted between the speaker and the user. Text messages may include short messages according to various protocols, including SMS, MMS, and the like.
At block 7.2001, the process performs informing the user of an event involving the user and the speaker. An event may be any occurrence that involves or involved the user and the speaker, such as a meeting (e.g., social or professional meeting or gathering) attended by the user and the speaker, an upcoming deadline (e.g., for a project), or the like.
At block 7.2101, the process performs informing the user of a previously occurring event and/or a future event.
At block 7.2201, the process performs informing the user of at least one of a project, a meeting, and/or a deadline.
At block 7.2301, the process performs accessing information items associated with the speaker. In some embodiments, accessing information items associated with the speaker may include retrieving files, documents, data records, or the like from various sources, such as local or remote storage devices, including cloud-based servers, and the like. In some embodiments, accessing information items may also or instead include scanning, searching, indexing, or otherwise processing information items to find ones that include, name, mention, or otherwise reference the speaker.
At block 7.2401, the process performs searching for information items that reference the speaker. In some embodiments, searching may include formulating a search query to provide to a document management system or any other data/document store that provides a search interface.
At block 7.2501, the process performs searching stored emails to find emails that reference the speaker. In some embodiments, emails that reference the speaker may include emails sent from the speaker, emails sent to the speaker, emails that name or otherwise identify the speaker in the body of an email, or the like.
At block 7.2601, the process performs searching stored text messages to find text messages that reference the speaker. In some embodiments, text messages that reference the speaker include messages sent to/from the speaker, messages that name or otherwise identify the speaker in a message body, or the like.
At block 7.2701, the process performs accessing a social networking service to find messages or status updates that reference the speaker. In some embodiments, accessing a social networking service may include searching for postings, status updates, personal messages, or the like that have been posted by, posted to, or otherwise reference the speaker. Example social networking services include Facebook, Twitter, Google Plus, and the like. Access to a social networking service may be obtained via an API or similar interface that provides access to social networking data related to the user and/or the speaker.
At block 7.2801, the process performs accessing a calendar to find information about appointments with the speaker. In some embodiments, accessing a calendar may include searching a private or shared calendar to locate a meeting or other appointment with the speaker, and providing such information to the user via the hearing device.
At block 7.2901, the process performs accessing a document store to find documents that reference the speaker. In some embodiments, documents that reference the speaker include those that are authored at least in part by the speaker, those that name or otherwise identify the speaker in a document body, or the like. Accessing the document store may include accessing a local or remote storage device/system, accessing a document management system, accessing a source control system, or the like.
At block 7.3001, the process performs performing voice identification based on the received data to identify the speaker. In some embodiments, voice identification may include generating a voice print, voice model, or other biometric feature set that characterizes the voice of the speaker, and then comparing the generated voice print to previously generated voice prints.
At block 7.3101, the process performs comparing properties of the speech signal with properties of previously recorded speech signals from multiple distinct speakers. In some embodiments, the process accesses voice prints associated with multiple speakers, and determines a best match against the speech signal.
At block 7.3201, the process performs processing voice messages from the multiple distinct speakers to generate voice print data for each of the multiple distinct speakers. Given a telephone voice message, the process may associate generated voice print data for the voice message with one or more (direct or indirect) identifiers corresponding with the message. For example, the message may have a sender telephone number associated with it, and the process can use that sender telephone number to do a reverse directory lookup (e.g., in a public directory, in a personal contact list) to determine the name of the voice message speaker.
At block 7.3301, the process performs processing telephone voice messages stored by a voice mail service. In some embodiments, the process analyzes voice messages to generate voice prints/models for multiple speakers.
At block 7.3401, the process performs performing speech recognition to convert the received data into text data. For example, the process may convert the received data into a sequence of words that are (or are likely to be) the words uttered by the speaker.
At block 7.3402, the process performs identifying the speaker based on the text data. Given text data (e.g., words spoken by the speaker), the process may search for information items that include the text data, and then identify the speaker based on those information items, as discussed further below.
At block 7.3501, the process performs finding a document that references the speaker and that includes one or more words in the text data. In some embodiments, the process may search for and find a document or other item that includes words spoken by speaker. Then, the process can infer that the speaker is the author of the document, a recipient of the document, a person described in the document, or the like.
At block 7.3601, the process performs performing speech recognition based on cepstral coefficients that represent the speech signal. In other embodiments, other types of features or information may be also or instead used to perform speech recognition, including language models, dialect models, or the like.
At block 7.3701, the process performs performing hidden Markov model-based speech recognition. Other approaches or techniques for speech recognition may include neural networks, stochastic modeling, or the like.
At block 7.3801, the process performs retrieving information items that reference the text data. The process may here retrieve or otherwise obtain documents, calendar events, messages, or the like, that include, contain, or otherwise reference some portion of the text data.
At block 7.3802, the process performs informing the user of the retrieved information items.
At block 7.3901, the process performs converting the text data into audio data that represents a voice of a different speaker. In some embodiments, the process may perform this conversion by performing text-to-speech processing to read the text data in a different voice.
At block 7.3902, the process performs causing the audio data to be played through the hearing device.
At block 7.4001, the process performs performing speech recognition based at least in part on a language model associated with the speaker. A language model may be used to improve or enhance speech recognition. For example, the language model may represent word transition likelihoods (e.g., by way of n-grams) that can be advantageously employed to enhance speech recognition. Furthermore, such a language model may be speaker specific, in that it may be based on communications or other information generated by the speaker.
At block 7.4101, the process performs generating the language model based on communications generated by the speaker. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like to generate a language model that is specific or otherwise tailored to the speaker.
At block 7.4201, the process performs generating the language model based on emails transmitted by the speaker.
At block 7.4301, the process performs generating the language model based on documents authored by the speaker.
At block 7.4401, the process performs generating the language model based on social network messages transmitted by the speaker.
At block 7.4501, the process performs receiving data representing a speech signal that represents an utterance of the user. A microphone on or about the hearing device may capture this data. The microphone may be the same or different from one used to capture speech data from the speaker.
At block 7.4502, the process performs identifying the speaker based on the data representing a speech signal that represents an utterance of the user. Identifying the speaker in this manner may include performing speech recognition on the user's utterance, and then processing the resulting text data to locate a name. This identification can then be utilized to retrieve information items or other speaker-related information that may be useful to present to the user.
At block 7.4601, the process performs determining whether the utterance of the user includes a name of the speaker.
At block 7.4701, the process performs receiving context information related to the user. Context information may generally include information about the setting, location, occupation, communication, workflow, or other event or factor that is present at, about, or with respect to the user.
At block 7.4702, the process performs identifying the speaker, based on the context information. Context information may be used to improve or enhance speaker identification, such as by determining or narrowing a set of potential speakers based on the current location of the user
At block 7.4801, the process performs receiving an indication of a location of the user.
At block 7.4802, the process performs determining a plurality of persons with whom the user commonly interacts at the location. For example, if the indicated location is a workplace, the process may generate a list of co-workers, thereby reducing or simplifying the problem of speaker identification.
At block 7.4901, the process performs receiving a GPS location from a mobile device of the user.
At block 7.5001, the process performs receiving a network identifier that is associated with the location. The network identifier may be, for example, a service set identifier (“SSID”) of a wireless network with which the user is currently associated.
At block 7.5101, the process performs receiving an indication that the user is at a workplace. For example, the process may translate a coordinate-based location (e.g., GPS coordinates) to a particular workplace by performing a map lookup or other mechanism.
At block 7.5201, the process performs receiving an indication that the user is at a residence.
At block 7.5301, the process performs receiving information about a communication that references the speaker. As noted, context information may include communications. In this case, the process may exploit such communications to improve speaker identification or other operations.
At block 7.5401, the process performs receiving information about a message and/or a document that references the speaker.
At block 7.5501, the process performs receiving data representing an ongoing conversation amongst multiple speakers. In some embodiments, the process is operable to identify multiple distinct speakers, such as when a group is meeting via a conference call.
At block 7.5502, the process performs identifying the multiple speakers based on the data representing the ongoing conversation.
At block 7.5503, the process performs as each of the multiple speakers takes a turn speaking during the ongoing conversation, informing the user of a name or other speaker-related information associated with the speaker. In this manner, the process may, in substantially real time, provide the user with indications of a current speaker, even though such a speaker may not be visible or even previously known to the user.
At block 7.5601, the process performs receiving audio data from a telephonic conference call, the received audio data representing utterances made by at least one of the multiple speakers.
At block 7.5701, the process performs presenting, while a current speaker is speaking, speaker-related information on a display device of the user, the displayed speaker-related information identifying the current speaker. For example, as the user engages in a conference call from his office, the process may present the name or other information about the current speaker on a display of a desktop computer in the office of the user.
At block 7.5801, the process performs developing a corpus of speaker data by recording speech from a plurality of speakers.
At block 7.5802, the process performs identifying the speaker based at least in part on the corpus of speaker data. Over time, the process may gather and record speech obtained during its operation, and then use that speech as part of a corpus that is used during future operation. In this manner, the process may improve its performance by utilizing actual, environmental speech data, possibly along with feedback received from the user, as discussed below.
At block 7.5901, the process performs generating a speech model associated with each of the plurality of speakers, based on the recorded speech. The generated speech model may include voice print data that can be used for speaker identification, a language model that may be used for speech recognition purposes, a noise model that may be used to improve operation in speaker-specific noisy environments.
At block 7.6001, the process performs receiving feedback regarding accuracy of the speaker-related information. During or after providing speaker-related information to the user, the user may provide feedback regarding its accuracy. This feedback may then be used to train a speech processor (e.g., a speaker identification module, a speech recognition module). Feedback may be provided in various ways, such as by processing positive/negative utterances from the speaker (e.g., “That is not my name”), receiving a positive/negative utterance from the user (e.g., “I am sorry.”), receiving a keyboard/button event that indicates a correct or incorrect identification.
At block 7.6002, the process performs training a speech processor based at least in part on the received feedback.
At block 7.6101, the process performs transmitting the speaker-related information from a first device to a second device having a display. In some embodiments, at least some of the processing may be performed on distinct devices, resulting in a transmission of speaker-related information from one device to the device having the display.
At block 7.6201, the process performs wirelessly transmitting the speaker-related information. Various protocols may be used, including Bluetooth, infrared, WiFi, or the like.
At block 7.6301, the process performs transmitting the speaker-related information from a smart phone or portable media player to the second device. For example a smart phone may forward the speaker-related information to a desktop computing system for display on an associated monitor.
At block 7.6401, the process performs transmitting the speaker-related information from a server system to the second device. In some embodiments, some portion of the processing is performed on a server system that may be remote from the hearing device.
At block 7.6501, the process performs transmitting the speaker-related information from a server system that resides in a data center.
At block 7.6601, the process performs transmitting the speaker-related information from a server system to a desktop computer of the user.
At block 7.6701, the process performs transmitting the speaker-related information from a server system to a mobile device of the user.
At block 7.6801, the process performs performing the receiving data representing a speech signal, the identifying the speaker, and/or the determining speaker-related information on a mobile device that is operated by the user. As noted, In some embodiments a mobile device such as a smart phone or media player may have sufficient processing power to perform a portion of the process, such as identifying the speaker, determining the speaker-related information, or the like.
At block 7.6901, the process performs identifying the speaker, performed on a smart phone or a media player that is operated by the user.
At block 7.7001, the process performs performing the receiving data representing a speech signal, the identifying the speaker, and/or the determining speaker-related information on a desktop computer that is operated by the user. For example, in an office setting, the user's desktop computer may be configured to perform some or all of the process.
At block 7.7101, the process performs determining to perform at least some of identifying the speaker or determining speaker-related information on another computing device that has available processing capacity. In some embodiments, the process may determine to offload some of its processing to another computing device or system.
At block 7.7201, the process performs receiving at least some of speaker-related information from the another computing device. The process may receive the speaker-related information or a portion thereof from the other computing device.
At block 7.7301, the process performs determining whether or not the user can name the speaker.
At block 7.7302, the process performs when it is determined that the user cannot name the speaker, visually presenting the speaker-related information. In some embodiments, the process only informs the user of the speaker-related information upon determining that the speaker does not appear to be able to name the speaker.
At block 7.7401, the process performs determining whether the user has named the speaker. In some embodiments, the process listens to the user to determine whether the user has named the speaker.
At block 7.7501, the process performs determining whether the speaker has uttered a given name or surname of the speaker.
At block 7.7601, the process performs determining whether the speaker has uttered a nickname of the speaker.
At block 7.7701, the process performs determining whether the speaker has uttered a name of a relationship between the user and the speaker. In some embodiments, the user need not utter the name of the speaker, but instead may utter other information (e.g., a relationship) that may be used by the process to determine that user knows or can name the speaker.
At block 7.7801, the process performs determining whether the user has uttered information that is related to both the speaker and the user.
At block 7.7901, the process performs determining whether the user has named a person, place, thing, or event that the speaker and the user have in common. For example, the user may mention a visit to the home town of the speaker, a vacation to a place familiar to the speaker, or the like
At block 7.8001, the process performs performing speech recognition to convert an utterance of the user into text data.
At block 7.8002, the process performs determining whether or not the user can name the speaker based at least in part on the text data.
At block 7.8101, the process performs when the user does not name the speaker within a predetermined time interval, determining that the user cannot name the speaker. In some embodiments, the process waits for a time period before jumping in to provide the speaker-related information.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 5.100. In addition, the computing system 8.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 5.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 8.400 comprises a computer memory (“memory”) 8.401, a display 8.402, one or more Central Processing Units (“CPU”) 8.403, Input/Output devices 8.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 8.405, and network connections 8.406. The AEFS 5.100 is shown residing in memory 8.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 5.100 may be stored on and/or transmitted over the other computer-readable media 8.405. The components of the AEFS 5.100 preferably execute on one or more CPUs 8.403 and recommend content items, as described herein. Other code or programs 8.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 8.420, also reside in the memory 8.401, and preferably execute on one or more CPUs 8.403. Of note, one or more of the components in
The AEFS 5.100 interacts via the network 8.450 with hearing devices 5.120, speaker-related information sources 5.130, and third-party systems/applications 8.455. The network 8.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 8.455 may include any systems that provide data to, or utilize data from, the AEFS 5.100, including Web browsers, e-commerce sites, calendar applications, email systems, social networking services, and the like.
The AEFS 5.100 is shown executing in the memory 8.401 of the computing system 8.400. Also included in the memory are a user interface manager 8.415 and an application program interface (“API”) 8.416. The user interface manager 8.415 and the API 8.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 5.100.
The UI manager 8.415 provides a view and a controller that facilitate user interaction with the AEFS 5.100 and its various components. For example, the UI manager 8.415 may provide interactive access to the AEFS 5.100, such that users can configure the operation of the AEFS 5.100, such as by providing the AEFS 5.100 credentials to access various sources of speaker-related information, including social networking services, email systems, document stores, or the like. In some embodiments, access to the functionality of the UI manager 8.415 may be provided via a Web server, possibly executing as one of the other programs 8.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 8.455 can interact with the AEFS 5.100 via the UI manager 8.415.
The API 8.416 provides programmatic access to one or more functions of the AEFS 5.100. For example, the API 8.416 may provide a programmatic interface to one or more functions of the AEFS 5.100 that may be invoked by one of the other programs 8.430 or some other module. In this manner, the API 8.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 5.100 into Web applications), and the like.
In addition, the API 8.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the hearing devices 5.120, information sources 5.130, and/or one of the third-party systems/applications 8.455, to access various functions of the AEFS 5.100. For example, an information source 5.130 may push speaker-related information (e.g., emails, documents, calendar events) to the AEFS 5.100 via the API 8.416. The API 8.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 8.455 and that are configured to interact with the AEFS 5.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 5.100 are implemented using standard programming techniques. For example, the AEFS 5.100 may be implemented as a “native” executable running on the CPU 8.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 5.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 8.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 5.100, such as in the data store 8.417, can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 8.417 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 5.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for ability enhancement and, more particularly, for language translation enhanced by using speaker-related information determined at least in part on speaker utterances. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”). The AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities of a user, such as by performing automatic language translation from a first language used by a speaker to a second language that is familiar to a user. For example, when a user engages a speaker in conversation, the AEFS may “listen” to the speaker in order to determine speaker-related information, such as demographic information about the speaker (e.g., gender, language, country/region of origin), identifying information about the speaker (e.g., name, title), and/or events/communications relating to the speaker and/or the user. Then, the AEFS may use the determined information to augment, improve, enhance, adapt, or otherwise configure the operation of automatic language translation performed on foreign language utterances of the speaker. As the speaker generates utterances in the foreign language, the AEFS may translate the utterances into a representation (e.g., a message in textual format) in a second language that is familiar to the user. The AEFS can then present the representation in the second language to the user, allowing the user to engage in a more productive conversation with the speaker.
In some embodiments, the AEFS is configured to receive data that represents an utterance of a speaker in a first language and that is obtained at or about a hearing device associated with a user. The hearing device may be or include any device that is used by the user to hear sounds, including a hearing aid, a personal media device/player, a telephone, or the like. The AEFS may then determine speaker-related information associated with the speaker, based at least in part on the received data, such as by performing speaker recognition and/or speech recognition with the received data. The speaker-related information may be or include demographic information about the speaker (e.g., gender, country/region of origin, language(s) spoken by the speaker), identifying information about the speaker (e.g., name or title), and/or information items that reference the speaker (e.g., a document, event, communication).
Then, the AEFS may translate the utterance in the first language into a message in a second language, based at least in part on the speaker-related information. The message in the second language is at least an approximate translation of the utterance in the first language. Such a translation process may include some combination of speech recognition, natural language processing, machine translation, or the like. Upon performing the translation, the AEFS may present the message in the second language to the user. The message in the second language may be presented visually, such as via a visual display of a computing system/device that is accessible to the user. The message in the second language may also or instead be presented audibly, such as by “speaking” the message in the second language via speech synthesis through a hearing aid, audio speaker, or other audio output device accessible to the user. The presentation of the message in the second language may occur via the same or a different device than the hearing device that obtained the initial utterance.
In the scenario illustrated in
The hearing device 9.120 receives a speech signal that represents the utterance 9.110, such as by receiving a digital representation of an audio signal received by a microphone of the hearing device 9.120. The hearing device 9.120 then transmits data representing the speech signal to the AEFS 9.100. Transmitting the data representing the speech signal may include transmitting audio samples (e.g., raw audio data), compressed audio data, speech vectors (e.g., mel frequency cepstral coefficients), and/or any other data that may be used to represent an audio signal.
The AEFS 9.100 then determines speaker-related information associated with the speaker 9.102. Initially, the AEFS 9.100 may determine speaker-related information by automatically determining the language that is being used by the speaker 9.102. Determining the language may be based on signal processing techniques that identify signal characteristics unique to particular languages. Determining the language may also or instead be performed by simultaneous or concurrent application of multiple speech recognizers that are each configured to recognize speech in a corresponding language, and then choosing the language corresponding to the recognizer that produces the result having the highest confidence level. Determining the language may also or instead be based on contextual factors, such as GPS information indicating that the user 9.104 is in Germany, Austria, or some other reason where German is commonly spoken.
In some embodiments, determining speaker-related information may include identifying the speaker 9.102 based on the received data representing the speech signal. Identifying the speaker 9.102 may include performing speaker recognition, such as by generating a “voice print” from the received data and comparing the generated voice print to previously obtained voice prints. For example, the generated voice print may be compared to multiple voice prints that are stored as audio data 9.130c and that each correspond to a speaker, in order to determine a speaker who has a voice that most closely matches the voice of the speaker 9.102. The voice prints stored as audio data 9.130c may be generated based on various sources of data, including data corresponding to speakers previously identified by the AEFS 9.100, voice mail messages, speaker enrollment data, or the like.
In some embodiments, identifying the speaker 9.102 may include performing speech recognition, such as by automatically converting the received data representing the speech signal into text. The text of the speaker's utterance 9.110 may then be used to identify the speaker. In particular, the text may identify one or more entities such as information items (e.g., communications, documents), events (e.g., meetings, deadlines), persons, or the like, that may be used by the AEFS 9.100 to identify the speaker. The information items may be accessed with reference to the messages 9.130a and/or documents 9.130b. As one example, the speaker's utterance 9.110 may identify an email message that was sent to the speaker 9.102 and the user 9.104 (e.g., “That sure was a nasty email Bob sent us”). As another example, the speaker's utterance 9.110 may identify a meeting or other event to which both the speaker 9.102 and the user 9.104 are invited.
Note that in some cases, the speaker's utterance 9.110 may not definitively identify the speaker 9.102, such as because the user 9.104 may only have just met the speaker 9.102 (e.g., if the user is traveling). In other cases, a definitive identification may not be obtained because a communication being used to identify the speaker was sent to recipients in addition to the speaker 9.102 and the user 9.104, leaving some ambiguity as to the actual identity of the speaker. However, in such cases, a preliminary identification of multiple candidate speakers may still be used by the AEFS 9.100 to narrow the set of potential speakers, and may be combined with (or used to improve) other techniques for speaker identification, including speaker recognition as discussed above. In addition, even if the speaker 9.102 is unknown to the user 9.104 the AEFS 9.100 may still determine useful demographic or other speaker-related information that may be fruitfully employed for speech recognition purposes.
Note also that speaker-related information need not definitively identify the speaker. In particular, it may also or instead be or include other information about or related to the speaker, such as demographic information including the gender of the speaker 9.102, his country or region of origin, the language(s) spoken by the speaker 9.102, or the like. Speaker-related information may include an organization that includes the speaker (along with possibly other persons, such as a company or firm), an information item that references the speaker (and possibly other persons), an event involving the speaker, or the like. The speaker-related information may generally be determined with reference to the messages 9.130a, documents 9.130b, and/or audio data 9.130c. For example, having determined the identity of the speaker 9.102, the AEFS 9.100 may search for emails and/or documents that are stored as messages 9.130a and/or documents 9.103b and that reference (e.g., are sent to, are authored by, are named in) the speaker 9.102.
Other types of speaker-related information are contemplated, including social networking information, such as personal or professional relationship graphs represented by a social networking service, messages or status updates sent within a social network, or the like. Social networking information may also be derived from other sources, including email lists, contact lists, communication patterns (e.g., frequent recipients of emails), or the like.
Having determined speaker-related information, the AEFS 9.100 then translates the utterance 9.110 in German into an utterance in a second language. In this example, the second language is the preferred language of the user 9.104, English. In some embodiments, the AEFS 9.100 translates the utterance 9.110 by first performing speech recognition to translate the utterance 9.110 into a textual representation that includes a sequence of German words. Then, the AEFS 9.100 may translate the German text into a message including English text, using machine translation techniques. Speech recognition and/or machine translation may be modified, enhanced, and/or otherwise adapted based on the speaker-related information. For example, a speech recognizer may use speech or language models tailored to the speaker's gender, accent/dialect (e.g., determined based on country/region of origin), social class, or the like. As another example, a lexicon that is specific to the speaker 9.102 may be used during speech recognition and/or language translation. Such a lexicon may be determined based on prior communications of the speaker 9.102, profession of the speaker (e.g., engineer, attorney, doctor), or the like.
Once the AEFS 9.100 has translated the initial utterance 9.110 into a message in English, the AEFS 9.100 can present the English message to the user 9.104. Various techniques are contemplated. In one approach, the AEFS 9.100 causes the hearing device 9.120 (or some other device accessible to the user) to visually display the message as message 9.112 on the display 9.121. In the illustrated example, the AEFS 9.100 causes a message 9.112 that includes the text “My cat is sick” (which is the English translation of “Meine Katze ist krank”) to be displayed on the display 9.121. Upon reading the message 9.112 and thereby learning about the condition of the speaker's cat, the user 9.104 responds to the speaker's original utterance 9.110 by with a response utterance 9.114 by speaking the words “I can help.” The speaker 9.102 may either understand English or himself have access to the AEFS 9.100 so that the speaker 9.102 and the user 9.104 can have a productive conversation. As the speaker 9.102 and the user 9.104 continue to converse, the AEFS 9.100 may monitor the conversation and continue to provide translations to the user 9.104 (and possibly the speaker 9.102).
In another approach, the AEFS 9.100 causes the hearing device 9.120 (or some other device) to “speak” or “tell” the user 9.104 the message in English. Presenting a message in this manner may include converting a textual representation of the message into audio via text-to-speech processing (e.g., speech synthesis), and then presenting the audio via an audio speaker (e.g., earphone, earpiece, earbud) of the hearing device 9.120. In the illustrated scenario, the AEFS 9.100 causes the hearing device 9.120 to make an utterance 9.113 by playing audio of the words “My cat is sick” via a speaker (not shown) of the hearing device 9.120.
As an initial matter, note that the AEFS 9.100 may use output devices of a hearing device or other devices to present translations as well as other information, such as speaker-related information that may generally assist the user 9.104 in interacting with the speaker 9.102. For example, in addition to providing translations, the AEFS 9.100 may present speaker-related information about the speaker 9.102, such as his name, title, communications that reference or are related to the speaker, and the like.
For audio output, each of the illustrated hearing devices 9.120 may include or be communicatively coupled to an audio speaker operable to generate and output audio signals that may be perceived by the user 9.104. As discussed above, the AEFS 9.100 may use such a speaker to provide translations to the user 9.104. The AEFS 9.100 may also or instead audibly notify, via a speaker of a hearing device 9.120, the user 9.104 to view a translation or other information displayed on the hearing device 9.120. For example, the AEFS 9.100 may cause a tone (e.g., beep, chime) to be played via the earphones 9.124 of the personal media player hearing device 9.120c. Such a tone may then be recognized by the user 9.104, who will in response attend to information displayed on the display 9.123. Such audible notification may be used to identify a display that is being used as a current display, such as when multiple displays are being used. For example, different first and second tones may be used to direct the user's attention to a desktop display and a smart phone display, respectively. In some embodiments, audible notification may include playing synthesized speech (e.g., from text-to-speech processing) telling the user 9.104 to view speaker-related information on a particular display device (e.g., “Recent email on your smart phone”).
The AEFS 9.100 may generally cause translations and/or speaker-related information to be presented on various destination output devices. In some embodiments, the AEFS 9.100 may use a display of a hearing device as a target for displaying a translation or other information. For example, the AEFS 9.100 may display a translation or speaker-related information on the display 9.121 of the smart phone 9.120a. On the other hand, when the hearing device does not have its own display, such as hearing aid device 9.120b, the AEFS 9.100 may display speaker-related information on some other destination display that is accessible to the user 9.104. For example, when the hearing aid device 9.120b is the hearing device and the user also has the personal media player 9.120c in his possession, the AEFS 9.100 may elect to display speaker-related information upon the display 9.123 of the personal media player 9.120c.
The AEFS 9.100 may determine a destination output device for a translation, speaker-related information, or other information. In some embodiments, determining a destination output device may include selecting from one of multiple possible destination displays based on whether a display is capable of displaying all of the information. For example, if the environment is noisy, the AEFS may elect to visually display a translation rather than play it through a speaker. As another example, if the user 9.104 is proximate to a first display that is capable of displaying only text and a second display capable of displaying graphics, the AEFS 9.100 may select the second display when the presented information includes graphics content (e.g., an image). In some embodiments, determining a destination display may include selecting from one of multiple possible destination displays based on the size of each display. For example, a small LCD display (such as may be found on a mobile phone) may be suitable for displaying a message that is just a few characters (e.g., a name or greeting) but not be suitable for displaying longer message or large document. Note that the AEFS 9.100 may select between multiple potential target output devices even when the hearing device itself includes its own display and/or speaker.
Determining a destination output device may be based on other or additional factors. In some embodiments, the AEFS 9.100 may use user preferences that have been inferred (e.g., based on current or prior interactions with the user 9.104) and/or explicitly provided by the user. For example, the AEFS 9.100 may determine to present a translation, an email, or other speaker-related information onto the display 9.121 of the smart phone 9.120a based on the fact that the user 9.104 is currently interacting with the smart phone 9.120a.
Note that although the AEFS 9.100 is shown as being separate from a hearing device 9.120, some or all of the functions of the AEFS 9.100 may be performed within or by the hearing device 9.120 itself. For example, the smart phone hearing device 9.120a and/or the media player hearing device 9.120c may have sufficient processing power to perform all or some functions of the AEFS 9.100, including one or more of speaker identification, determining speaker-related information, speaker recognition, speech recognition, language translation, presenting information, or the like. In some embodiments, the hearing device 9.120 includes logic to determine where to perform various processing tasks, so as to advantageously distribute processing between available resources, including that of the hearing device 9.120, other nearby devices (e.g., a laptop or other computing device of the user 9.104 and/or the speaker 9.102), remote devices (e.g., “cloud-based” processing and/or storage), and the like.
Other types of hearing devices are contemplated. For example, a land-line telephone may be configured to operate as a hearing device, so that the AEFS 9.100 can translate utterances from speakers who are engaged in a conference call. As another example, a hearing device may be or be part of a desktop computer, laptop computer, PDA, tablet computer, or the like.
The speech and language engine 10.210 includes a speech recognizer 10.212, a speaker recognizer 10.214, a natural language processor 10.216, and a language translation processor 10.218. The speech recognizer 10.212 transforms speech audio data received from the hearing device 9.120 into textual representation of an utterance represented by the speech audio data. In some embodiments, the performance of the speech recognizer 10.212 may be improved or augmented by use of a language model (e.g., representing likelihoods of transitions between words, such as based on n-grams) or speech model (e.g., representing acoustic properties of a speaker's voice) that is tailored to or based on an identified speaker. For example, once a speaker has been identified, the speech recognizer 10.212 may use a language model that was previously generated based on a corpus of communications and other information items authored by the identified speaker. A speaker-specific language model may be generated based on a corpus of documents and/or messages authored by a speaker. Speaker-specific speech models may be used to account for accents or channel properties (e.g., due to environmental factors or communication equipment) that are specific to a particular speaker, and may be generated based on a corpus of recorded speech from the speaker. In some embodiments, multiple speech recognizers are present, each one configured to recognize speech in a different language.
The speaker recognizer 10.214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the hearing device 9.120. The speaker recognizer 10.214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 10.240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data.
The natural language processor 10.216 processes text generated by the speech recognizer 10.212 and/or located in information items obtained from the speaker-related information sources 9.130. In doing so, the natural language processor 10.216 may identify relationships, events, or entities (e.g., people, places, things) that may facilitate speaker identification, language translation, and/or other functions of the AEFS 9.100. For example, the natural language processor 10.216 may process status updates posted by the user 9.104 on a social networking service, to determine that the user 9.104 recently attended a conference in a particular city, and this fact may be used to identify a speaker and/or determine other speaker-related information, which may in turn be used for language translation or other functions.
The language translation processor 10.218 translates from one language to another, for example, by converting text in a first language to text in a second language. The text input to the language translation processor 10.218 may be obtained from, for example, the speech recognizer 10.212 and/or the natural language processor 10.216. The language translation processor 10.218 may use speaker-related information to improve or adapt its performance. For example, the language translation processor 10.218 may use a lexicon or vocabulary that is tailored to the speaker, such as may be based on the speaker's country/region of origin, the speaker's social class, the speaker's profession, or the like.
The agent logic 10.220 implements the core intelligence of the AEFS 9.100. The agent logic 10.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to identify speakers, determine speaker-related information, and/or perform translations. For example, the agent logic 10.220 may combine spoken text from the speech recognizer 10.212, a set of potentially matching (candidate) speakers from the speaker recognizer 10.214, and information items from the information sources 9.130, in order to determine a most likely identity of the current speaker. As another example, the agent logic 10.220 may identify the language spoken by the speaker by analyzing the output of multiple speech recognizers that are each configured to recognize speech in a different language, to identify the language of the speech recognizer that returns the highest confidence result as the spoken language.
The presentation engine 10.230 includes a visible output processor 10.232 and an audible output processor 10.234. The visible output processor 10.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the hearing device 9.120 or some other display (e.g., a desktop or laptop display in proximity to the user 9.104). The agent logic 10.220 may use or invoke the visible output processor 10.232 to prepare and display information, such as by formatting or otherwise modifying a translation or some speaker-related information to fit on a particular type or size of display. The audible output processor 10.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 10.220 may use or invoke the audible output processor 10.234 in order to convert a textual message (e.g., a translation or speaker-related information) into audio output suitable for presentation via the hearing device 9.120, for example by employing a text-to-speech processor.
Note that although speaker identification and/or determining speaker-related information is herein sometimes described as including the positive identification of a single speaker, it may instead or also include determining likelihoods that each of one or more persons is the current speaker. For example, the speaker recognizer 10.214 may provide to the agent logic 10.220 indications of multiple candidate speakers, each having a corresponding likelihood or confidence level. The agent logic 10.220 may then select the most likely candidate based on the likelihoods alone or in combination with other information, such as that provided by the speech recognizer 10.212, natural language processor 10.216, speaker-related information sources 9.130, or the like. In some cases, such as when there are a small number of reasonably likely candidate speakers, the agent logic 10.220 may inform the user 9.104 of the identities all of the candidate speakers (as opposed to a single speaker) candidate speaker, as such information may be sufficient to trigger the user's recall and enable the user to make a selection that informs the agent logic 10.220 of the speaker's identity.
FIGS. 11.1-11.80 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 11.101, the process performs receiving data representing a speech signal obtained at a hearing device associated with a user, the speech signal representing an utterance of a speaker in a first language. The received data may be or represent the speech signal itself (e.g., audio samples) and/or higher-order information (e.g., frequency coefficients). The data may be received by or at the hearing device 9.120 and/or the AEFS 9.100.
At block 11.102, the process performs determining speaker-related information associated with the speaker, based on the data representing the speech signal. The speaker-related information may include demographic information about the speaker, including gender, language spoken, country of origin, region of origin, or the like. The speaker-related information may also or instead include identifiers of the speaker (e.g., names, titles) and/or related information, such as documents, emails, calendar events, or the like. The speaker-related information may be determined based on signal properties of the speech signal (e.g., a voice print) and/or on the content of the utterance, such as a name, event, entity, or information item that was mentioned by the speaker.
At block 11.103, the process performs translating the utterance in the first language into a message in a second language, based on the speaker-related information. The utterance may be translated by first performing speech recognition on the data representing the speech signal to convert the utterance into textual form. Then, the text of the utterance may be translated into the second language using a natural language processing and/or machine translation techniques. The speaker-related information may be used to improve, enhance, or otherwise modify the process of machine translation. For example, based on the identity of the speaker, the process may use a language or speech model that is tailored to the speaker in order to improve a machine translation process. As another example, the process may use one or more information items that reference the speaker to improve machine translation, such as by disambiguating references in the utterance of the speaker.
At block 11.104, the process performs presenting the message in the second language. The message may be presented in various ways including using audible output (e.g., via text-to-speech processing of the message) and/or using visible output of the message (e.g., via a display screen of the hearing device or some other device that is accessible to the user).
At block 11.201, the process performs determining the first language. In some embodiments, the process may determine or identify the first language, possibly prior to performing language translation. For example, the process may determine that the speaker is speaking in German, so that it can configure a speech recognizer to recognize German language utterances.
At block 11.301, the process performs concurrently processing the received data with multiple speech recognizers that are each configured to recognize speech in a different corresponding language. For example, the process may utilize speech recognizers for German, French, English, Chinese, Spanish, and the like, to attempt to recognize the speaker's utterance.
At block 11.302, the process performs selecting as the first language the language corresponding to a speech recognizer of the multiple speech recognizers that produces a result that has a higher confidence level than other of the multiple speech recognizers. Typically, a speech recognizer may provide a confidence level corresponding with each recognition result. The process can exploit this confidence level to determine the most likely language being spoken by the speaker, such as by taking the result with the highest confidence level, if one exists.
At block 11.401, the process performs identifying signal characteristics in the received data that are correlated with the first language. In some embodiments, the process may exploit signal properties or characteristics that are highly correlated with particular languages. For example, spoken German may include phonemes that are unique to or at least more common in German than in other languages.
At block 11.501, the process performs receiving an indication of a current location of the user. The current location may be based on a GPS coordinate provided by the hearing device 9.120 or some other device. The current location may be determined based on other context information, such as a network identifier, travel documents, or the like.
At block 11.502, the process performs determining one or more languages that are commonly spoken at the current location. The process may reference a knowledge base or other information that associates locations with common languages.
At block 11.503, the process performs selecting one of the one or more languages as the first language.
At block 11.601, the process performs presenting indications of multiple languages to the user. In some embodiments, the process may ask the user to choose the language of the speaker. For example, the process may not be able to determine the language itself, or the process may have determined multiple equally likely candidate languages. In such circumstances, the process may prompt or otherwise request that the user indicate the language of the speaker.
At block 11.602, the process performs receiving from the user an indication of one of the multiple languages. The user may identify the language in various ways, such as via a spoken command, a gesture, a user interface input, or the like.
At block 11.701, the process performs selecting a speech recognizer configured to recognize speech in the first language. Once the process has determined the language of the speaker, it may select or configure a speech recognizer or other component (e.g., machine translation engine) to process the first language.
At block 11.801, the process performs performing speech recognition, based on the speaker-related information, on the data representing the speech signal to convert the utterance in the first language into text representing the utterance in the first language. The speech recognition process may be improved, augmented, or otherwise adapted based on the speaker-related information. In one example, information about vocabulary frequently used by the speaker may be used to improve the performance of a speech recognizer.
At block 11.802, the process performs translating, based on the speaker-related information, the text representing the utterance in the first language into text representing the message in the second language. Translating from a first to a second language may also be improved, augmented, or otherwise adapted based on the speaker-related information. For example, when such a translation includes natural language processing to determine syntactic or semantic information about an utterance, such natural language processing may be improved with information about the speaker, such as idioms, expressions, or other language constructs frequently employed or otherwise correlated with the speaker.
At block 11.901, the process performs performing speech synthesis to convert the text representing the utterance in the second language into audio data representing the message in the second language.
At block 11.902, the process performs causing the audio data representing the message in the second language to be played to the user. The message may be played, for example, via an audio speaker of the hearing device 9.120.
At block 11.1001, the process performs performing speech recognition based on cepstral coefficients that represent the speech signal. In other embodiments, other types of features or information may be also or instead used to perform speech recognition, including language models, dialect models, or the like.
At block 11.1101, the process performs performing hidden Markov model-based speech recognition. Other approaches or techniques for speech recognition may include neural networks, stochastic modeling, or the like.
At block 11.1201, the process performs translating the utterance based on speaker-related information including an identity of the speaker. The identity of the speaker may be used in various ways, such as to determine a speaker-specific vocabulary to use during speech recognition, natural language processing, machine translation, or the like.
At block 11.1301, the process performs translating the utterance based on speaker-related information including a language model that is specific to the speaker. A speaker-specific language model may include or otherwise identify frequent words or patterns of words (e.g., n-grams) based on prior communications or other information about the speaker. Such a language model may be based on communications or other information generated by or about the speaker. Such a language model may be employed in the course of speech recognition, natural language processing, machine translation, or the like. Note that the language model need not be unique to the speaker, but may instead be specific to a class, type, or group of speakers that includes the speaker. For example, the language model may be tailored for speakers in a particular industry, from a particular region, or the like.
At block 11.1401, the process performs translating the utterance based on a language model that is tailored to a group of people of which the speaker is a member. As noted, the language model need not be unique to the speaker. In some embodiments, the language model may be tuned to particular social classes, ethnic groups, countries, languages, or the like with which the speaker may be associated.
At block 11.1501, the process performs generating the language model based on communications generated by the speaker. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like to generate a language model that is specific or otherwise tailored to the speaker.
At block 11.1601, the process performs generating the language model based on emails transmitted by the speaker. In some embodiments, a corpus of emails may be processed to determine n-grams that represent likelihoods of various word transitions.
At block 11.1701, the process performs generating the language model based on documents authored by the speaker. In some embodiments, a corpus of documents may be processed to determine n-grams that represent likelihoods of various word transitions.
At block 11.1801, the process performs generating the language model based on social network messages transmitted by the speaker.
At block 11.1901, the process performs translating the utterance based on speaker-related information including a speech model that is tailored to the speaker. A speech model tailored to the speaker (e.g., representing properties of the speech signal of the user) may be used to adapt or improve the performance of a speech recognizer. Note that the speech model need not be unique to the speaker, but may instead be specific to a class, type, or group of speakers that includes the speaker. For example, the speech model may be tailored for male speakers, female speakers, speakers from a particular country or region (e.g., to account for accents), or the like.
At block 11.2001, the process performs translating the utterance based on a speech model that is tailored to a group of people of which the speaker is a member. As noted, the speech model need not be unique to the speaker. In some embodiments, the speech model may be tuned to particular genders, social classes, ethnic groups, countries, languages, or the like with which the speaker may be associated.
At block 11.2101, the process performs translating the utterance based on speaker-related information including an information item that references the speaker. The information item may include a document, a message, a calendar event, a social networking relation, or the like. Various forms of information items are contemplated, including textual (e.g., emails, text messages, chats), audio (e.g., voice messages), video, or the like. In some embodiments, an information item may include content in multiple forms, such as text and audio, such as when an email includes a voice attachment.
At block 11.2201, the process performs translating the utterance based on speaker-related information including a document that references the speaker. The document may be, for example, a report authored by the speaker.
At block 11.2301, the process performs translating the utterance based on speaker-related information including a message that references the speaker. The message may be an email, text message, social network status update or other communication that is sent by the speaker, sent to the speaker, or references the speaker in some other way.
At block 11.2401, the process performs translating the utterance based on speaker-related information including a calendar event that references the speaker. The calendar event may represent a past or future event to which the speaker was invited. An event may be any occurrence that involves or involved the user and/or the speaker, such as a meeting (e.g., social or professional meeting or gathering) attended by the user and the speaker, an upcoming deadline (e.g., for a project), or the like.
At block 11.2501, the process performs translating the utterance based on speaker-related information including an indication of gender of the speaker. Information about the gender of the speaker may be used to customize or otherwise adapt a speech or language model that may be used during machine translation.
At block 11.2601, the process performs translating the utterance based on speaker-related information including an organization to which the speaker belongs. The process may exploit an understanding of an organization to which the speaker belongs when performing natural language processing on the utterance. For example, the identity of a company that employs the speaker can be used to determine the meaning of industry-specific vocabulary in the utterance of the speaker. The organization may include a business, company (e.g., profit or non-profit), group, school, club, team, company, or other formal or informal organization with which the speaker is affiliated.
At block 11.2701, the process performs performing speech recognition to convert the received data into text data. For example, the process may convert the received data into a sequence of words that are (or are likely to be) the words uttered by the speaker.
At block 11.2702, the process performs determining the speaker-related information based on the text data. Given text data (e.g., words spoken by the speaker), the process may search for information items that include the text data, and then identify the speaker or determine other speaker-related information based on those information items, as discussed further below.
At block 11.2801, the process performs finding a document that references the speaker and that includes one or more words in the text data. In some embodiments, the process may search for and find a document or other item that includes words spoken by speaker. Then, the process can infer that the speaker is the author of the document, a recipient of the document, a person described in the document, or the like.
At block 11.2901, the process performs retrieving information items that reference the text data. The process may here retrieve or otherwise obtain documents, calendar events, messages, or the like, that include, contain, or otherwise reference some portion of the text data.
At block 11.3001, the process performs accessing information items associated with the speaker. In some embodiments, accessing information items associated with the speaker may include retrieving files, documents, data records, or the like from various sources, such as local or remote storage devices, including cloud-based servers, and the like. In some embodiments, accessing information items may also or instead include scanning, searching, indexing, or otherwise processing information items to find ones that include, name, mention, or otherwise reference the speaker.
At block 11.3101, the process performs searching for information items that reference the speaker. In some embodiments, searching may include formulating a search query to provide to a document management system or any other data/document store that provides a search interface.
At block 11.3201, the process performs searching stored emails to find emails that reference the speaker. In some embodiments, emails that reference the speaker may include emails sent from the speaker, emails sent to the speaker, emails that name or otherwise identify the speaker in the body of an email, or the like.
At block 11.3301, the process performs searching stored text messages to find text messages that reference the speaker. In some embodiments, text messages that reference the speaker include messages sent to/from the speaker, messages that name or otherwise identify the speaker in a message body, or the like.
At block 11.3401, the process performs accessing a social networking service to find messages or status updates that reference the speaker. In some embodiments, accessing a social networking service may include searching for postings, status updates, personal messages, or the like that have been posted by, posted to, or otherwise reference the speaker. Example social networking services include Facebook, Twitter, Google Plus, and the like. Access to a social networking service may be obtained via an API or similar interface that provides access to social networking data related to the user and/or the speaker.
At block 11.3501, the process performs accessing a calendar to find information about appointments with the speaker. In some embodiments, accessing a calendar may include searching a private or shared calendar to locate a meeting or other appointment with the speaker, and providing such information to the user via the hearing device.
At block 11.3601, the process performs accessing a document store to find documents that reference the speaker. In some embodiments, documents that reference the speaker include those that are authored at least in part by the speaker, those that name or otherwise identify the speaker in a document body, or the like. Accessing the document store may include accessing a local or remote storage device/system, accessing a document management system, accessing a source control system, or the like.
At block 11.3701, the process performs performing voice identification based on the received data to identify the speaker. In some embodiments, voice identification may include generating a voice print, voice model, or other biometric feature set that characterizes the voice of the speaker, and then comparing the generated voice print to previously generated voice prints.
At block 11.3801, the process performs comparing properties of the speech signal with properties of previously recorded speech signals from multiple distinct speakers. In some embodiments, the process accesses voice prints associated with multiple speakers, and determines a best match against the speech signal.
At block 11.3901, the process performs processing voice messages from the multiple distinct speakers to generate voice print data for each of the multiple distinct speakers. Given a telephone voice message, the process may associate generated voice print data for the voice message with one or more (direct or indirect) identifiers corresponding with the message. For example, the message may have a sender telephone number associated with it, and the process can use that sender telephone number to do a reverse directory lookup (e.g., in a public directory, in a personal contact list) to determine the name of the voice message speaker.
At block 11.4001, the process performs processing telephone voice messages stored by a voice mail service. In some embodiments, the process analyzes voice messages to generate voice prints/models for multiple speakers.
At block 11.4101, the process performs determining that the speaker cannot be identified. In some embodiments, the process may determine that the speaker cannot be identified, for example because the speaker has not been previously identified, enrolled, or otherwise encountered. In some cases, the process may be unable to identify the speaker due to signal quality, environmental conditions, or the like.
At block 11.4201, the process performs when it is determined that the speaker cannot be identified, storing the received data for system training. In some embodiments, the received data may be stored when the speaker cannot be identified, so that the system can be trained or otherwise configured to identify the speaker at a later time.
At block 11.4301, the process performs when it is determined that the speaker cannot be identified, notifying the user. In some embodiments, the user may be notified that the process cannot identify the speaker, such as by playing a tone, voice feedback, or displaying a message. The user may in response manually identify the speaker or otherwise provide speaker-related information (e.g., the language spoken by the speaker) so that the process can perform translation or other functions.
At block 11.4401, the process performs receiving data representing a speech signal that represents an utterance of the user. A microphone on or about the hearing device may capture this data. The microphone may be the same or different from one used to capture speech data from the speaker.
At block 11.4402, the process performs determining the speaker-related information based on the data representing a speech signal that represents an utterance of the user. Identifying the speaker in this manner may include performing speech recognition on the user's utterance, and then processing the resulting text data to locate a name. This identification can then be utilized to retrieve information items or other speaker-related information that may be useful to present to the user.
At block 11.4501, the process performs determining whether the utterance of the user includes a name of the speaker.
At block 11.4601, the process performs receiving context information related to the user. Context information may generally include information about the setting, location, occupation, communication, workflow, or other event or factor that is present at, about, or with respect to the user.
At block 11.4602, the process performs determining speaker-related information, based on the context information. Context information may be used to improve or enhance speaker identification, such as by determining or narrowing a set of potential speakers based on the current location of the user.
At block 11.4701, the process performs receiving an indication of a location of the user.
At block 11.4702, the process performs determining a plurality of persons with whom the user commonly interacts at the location. For example, if the indicated location is a workplace, the process may generate a list of co-workers, thereby reducing or simplifying the problem of speaker identification.
At block 11.4801, the process performs receiving a GPS location from a mobile device of the user.
At block 11.4901, the process performs receiving a network identifier that is associated with the location. The network identifier may be, for example, a service set identifier (“SSID”) of a wireless network with which the user is currently associated.
At block 11.5001, the process performs receiving an indication that the user is at a workplace or a residence. For example, the process may translate a coordinate-based location (e.g., GPS coordinates) to a particular workplace by performing a map lookup or other mechanism.
At block 11.5101, the process performs receiving information about a communication that references the speaker. As noted, context information may include communications. In this case, the process may exploit such communications to improve speaker identification or other operations.
At block 11.5201, the process performs receiving information about a message and/or a document that references the speaker.
At block 11.5301, the process performs identifying a plurality of candidate speakers. In some embodiments, more than one candidate speaker may be identified, such as by a voice identification process that returns multiple candidate speakers along with associated likelihoods and/or due to ambiguity or uncertainty regarding who is speaking.
At block 11.5302, the process performs presenting indications of the plurality of candidate speakers. The process may display or tell the user about the candidate speakers so that the user can select which one (if any) is the actual speaker.
At block 11.5401, the process performs receiving from the user a selection of one of the plurality of candidate speakers that is the speaker. The user may indicate, such as via a user interface input, a gesture, a spoken command, or the like, which of the plurality of candidate speakers is the actual speaker.
At block 11.5402, the process performs determining the speaker-related information based on the selection received from the user.
At block 11.5501, the process performs receiving from the user an indication that none of the plurality of candidate speakers are the speaker. The user may indicate, such as via a user interface input, a gesture, a spoken command, or the like, that he does not recognize any of the candidate speakers as the actual speaker.
At block 11.5502, the process performs training a speaker identification system based on the received indication. The received indication may in turn be used to train or otherwise improve performance of a speaker identification or recognition system.
At block 11.5601, the process performs training a speaker identification system based on a selection regarding the plurality of candidate speakers received from a user. An selection regarding which speaker is the actual speaker (or that the actual speaker is not recognized amongst the candidate speakers) may be used to train or otherwise improve performance of a speaker identification or recognition system.
At block 11.5701, the process performs developing a corpus of speaker data by recording speech from a plurality of speakers.
At block 11.5702, the process performs determining the speaker-related information and/or translating the utterance based at least in part on the corpus of speaker data. Over time, the process may gather and record speech obtained during its operation, and then use that speech as part of a corpus that is used during future operation. In this manner, the process may improve its performance by utilizing actual, environmental speech data, possibly along with feedback received from the user, as discussed below.
At block 11.5801, the process performs generating a speech model associated with each of the plurality of speakers, based on the recorded speech. The generated speech model may include voice print data that can be used for speaker identification, a language model that may be used for speech recognition purposes, a noise model that may be used to improve operation in speaker-specific noisy environments.
At block 11.5901, the process performs receiving feedback regarding accuracy of the speaker-related information. During or after providing speaker-related information to the user, the user may provide feedback regarding its accuracy. This feedback may then be used to train a speech processor (e.g., a speaker identification module, a speech recognition module). Feedback may be provided in various ways, such as by processing positive/negative utterances from the speaker (e.g., “That is not my name”), receiving a positive/negative utterance from the user (e.g., “I am sorry.”), receiving a keyboard/button event that indicates a correct or incorrect identification.
At block 11.5902, the process performs training a speech processor based at least in part on the received feedback.
At block 11.6001, the process performs transmitting the message in the second language from a first device to a second device. In some embodiments, at least some of the processing may be performed on distinct devices, resulting in a transmission of the translated utterance from one device to another device.
At block 11.6101, the process performs wirelessly transmitting the message in the second language. Various protocols may be used, including Bluetooth, infrared, WiFi, or the like.
At block 11.6201, the process performs transmitting the message in the second language from a smart phone or portable media device to the second device. For example a smart phone may forward the translated utterance to a desktop computing system for display on an associated monitor.
At block 11.6301, the process performs transmitting the message in the second language from a server system to the second device. In some embodiments, some portion of the processing is performed on a server system that may be remote from the hearing device or the second device.
At block 11.6401, the process performs transmitting the message in the second language from a server system that resides in a data center.
At block 11.6501, the process performs transmitting the message in the second language from a server system to a desktop computer of the user.
At block 11.6601, the process performs transmitting the message in the second language from a server system to a mobile device of the user.
At block 11.6701, the process performs performing the receiving data representing a speech signal, the determining speaker-related information, the translating the utterance in the first language into a message in a second language, and/or the presenting the message in the second language on a mobile device that is operated by the user. As noted, In some embodiments a mobile device such as a smart phone or media player may have sufficient processing power to perform a portion of the process, such as identifying the speaker, determining the speaker-related information, or the like.
At block 11.6801, the process performs performing the receiving data representing a speech signal, the determining speaker-related information, the translating the utterance in the first language into a message in a second language, and/or the presenting the message in the second language on a desktop computer that is operated by the user. For example, in an office setting, the user's desktop computer may be configured to perform some or all of the process.
At block 11.6901, the process performs determining to perform at least some of determining speaker-related information or translating the utterance in the first language into a message in a second language on another computing device that has available processing capacity. In some embodiments, the process may determine to offload some of its processing to another computing device or system.
At block 11.7001, the process performs receiving at least some of speaker-related information from the another computing device. The process may receive the speaker-related information or a portion thereof from the other computing device.
At block 11.7101, the process performs informing the user of the speaker-related information. The process may also inform the user of the speaker-related information, so that the user can utilize the information in his conversation with the speaker, or for other reasons.
At block 11.7201, the process performs receiving feedback from the user regarding correctness of the speaker-related information. The speaker may notify the process when the speaker-related information is incorrect or inaccurate, such as when the process has misidentified the speaker's language or name.
At block 11.7202, the process performs refining the speaker-related information based on the received feedback. The received feedback may be used to train or otherwise improve the performance of the AEFS.
At block 11.7301, the process performs presenting speaker-related information corresponding to each of multiple likely speakers.
At block 11.7302, the process performs receiving from the user an indication that the speaker is one of the multiple likely speakers.
At block 11.7401, the process performs presenting the speaker-related information on a display of the hearing device. In some embodiments, the hearing device may include a display. For example, where the hearing device is a smart phone or media device, the hearing device may include a display that provides a suitable medium for presenting the name or other identifier of the speaker.
At block 11.7501, the process performs presenting the speaker-related information on a display of a computing device that is distinct from the hearing device. In some embodiments, the hearing device may not itself include a display. For example, where the hearing device is a office phone, the process may elect to present the speaker-related information on a display of a nearby computing device, such as a desktop or laptop computer in the vicinity of the phone.
At block 11.7601, the process performs audibly informing the user to view the speaker-related information on a display device.
At block 11.7701, the process performs playing a tone via an audio speaker of the hearing device. The tone may include a beep, chime, or other type of notification.
At block 11.7801, the process performs playing synthesized speech via an audio speaker of the hearing device, the synthesized speech telling the user to view the display device. In some embodiments, the process may perform text-to-speech processing to generate audio of a textual message or notification, and this audio may then be played or otherwise output to the user via the hearing device.
At block 11.7901, the process performs telling the user that at least one of a document, a calendar event, and/or a communication is available for viewing on the display device. Telling the user about a document or other speaker-related information may include playing synthesized speech that includes an utterance to that effect.
At block 11.8001, the process performs audibly informing the user in a manner that is not audible to the speaker. For example, a tone or verbal message may be output via an earpiece speaker, such that other parties to the conversation (including the speaker) do not hear the notification. As another example, a tone or other notification may be into the earpiece of a telephone, such as when the process is performing its functions within the context of a telephonic conference call.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 9.100. In addition, the computing system 12.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 9.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 12.400 comprises a computer memory (“memory”) 12.401, a display 12.402, one or more Central Processing Units (“CPU”) 12.403, Input/Output devices 12.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 12.405, and network connections 12.406. The AEFS 9.100 is shown residing in memory 12.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 9.100 may be stored on and/or transmitted over the other computer-readable media 12.405. The components of the AEFS 9.100 preferably execute on one or more CPUs 12.403 and recommend content items, as described herein. Other code or programs 12.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 12.420, also reside in the memory 12.401, and preferably execute on one or more CPUs 12.403. Of note, one or more of the components in
The AEFS 9.100 interacts via the network 12.450 with hearing devices 9.120, speaker-related information sources 9.130, and third-party systems/applications 12.455. The network 12.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 12.455 may include any systems that provide data to, or utilize data from, the AEFS 9.100, including Web browsers, e-commerce sites, calendar applications, email systems, social networking services, and the like.
The AEFS 9.100 is shown executing in the memory 12.401 of the computing system 12.400. Also included in the memory are a user interface manager 12.415 and an application program interface (“API”) 12.416. The user interface manager 12.415 and the API 12.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 9.100.
The UI manager 12.415 provides a view and a controller that facilitate user interaction with the AEFS 9.100 and its various components. For example, the UI manager 12.415 may provide interactive access to the AEFS 9.100, such that users can configure the operation of the AEFS 9.100, such as by providing the AEFS 9.100 credentials to access various sources of speaker-related information, including social networking services, email systems, document stores, or the like. In some embodiments, access to the functionality of the UI manager 12.415 may be provided via a Web server, possibly executing as one of the other programs 12.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 12.455 can interact with the AEFS 9.100 via the UI manager 12.415.
The API 12.416 provides programmatic access to one or more functions of the AEFS 9.100. For example, the API 12.416 may provide a programmatic interface to one or more functions of the AEFS 9.100 that may be invoked by one of the other programs 12.430 or some other module. In this manner, the API 12.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 9.100 into Web applications), and the like.
In addition, the API 12.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the hearing devices 9.120, information sources 9.130, and/or one of the third-party systems/applications 12.455, to access various functions of the AEFS 9.100. For example, an information source 9.130 may push speaker-related information (e.g., emails, documents, calendar events) to the AEFS 9.100 via the API 12.416. The API 12.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 12.455 and that are configured to interact with the AEFS 9.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 9.100 are implemented using standard programming techniques. For example, the AEFS 9.100 may be implemented as a “native” executable running on the CPU 12.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 9.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 12.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 9.100, such as in the data store 12.420 (or 10.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 12.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 9.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for enhanced voice conferencing and, more particularly, for voice conferencing enhanced by presenting speaker-related information determined at least in part on speaker utterances. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”). The AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities of a user, such as by determining and presenting speaker-related information to participants in a conference call. For example, when multiple speakers engage in a voice conference (e.g., a telephone conference), the AEFS may “listen” to the voice conference in order to determine speaker-related information, such as identifying information (e.g., name, title) about the current speaker (or some other speaker) and/or events/communications relating to the current speaker and/or to the subject matter of the conference call generally. Then, the AEFS may inform a user (typically one of the participants in the voice conference) of the determined information, such as by presenting the information via a conferencing device (e.g., smart phone, laptop, desktop telephone) associated with the user. The user can then receive the information (e.g., by reading or hearing it via the conferencing device) provided by the AEFS and advantageously use that information to avoid embarrassment (e.g., due to an inability to identify the speaker), engage in a more productive conversation (e.g., by quickly accessing information about events, deadlines, or communications related to the speaker), or the like.
In some embodiments, the AEFS is configured to receive data that represents speech signals from a voice conference amongst multiple speakers. The multiple speakers may be remotely located from one another, such as by being in different rooms within a building, by being in different buildings within a site or campus, by being in different cities, or the like. Typically, the multiple speakers are each using a conferencing device, such as a land-line telephone, cell phone, smart phone, computer, or the like, to communicate with one another. The AEFS may obtain the data that represents the speech signals from one or more of the conferencing devices and/or from some intermediary point, such as a conference call facility, chat system, videoconferencing system, PBX, or the like. The AEFS may then determine voice conference-related information, including speaker-related information associated with the one or more of the speakers. Determining speaker-related information may include identifying the speaker based at least in part on the received data, such as by performing speaker recognition and/or speech recognition with the received data. Determining speaker-related information may also or instead include determining an identifier (e.g., name or title) of the speaker, an information item (e.g., a document, event, communication) that references the speaker, or the like. Then, the AEFS may inform a user of the determined speaker-related information by, for example, visually presenting the speaker-related information via a display screen of a conferencing device associated with the user. In other embodiments, some other display may be used, such as a screen on a laptop computer that is being used by the user while the user is engaged in the voice conference via a telephone. In some embodiments, the AEFS may inform the user in an audible manner, such as by “speaking” the determined speaker-related information via an audio speaker of the conferencing device.
In some embodiments, the AEFS may perform other services, including translating utterances made by speakers in a voice conference, so that a multi-lingual voice conference may be facilitated even when some speakers do not understand the language used by other speakers. In such cases, the determined speaker-related information may be used to enhance or augment language translation and/or related processes, including speech recognition, natural language processing, and the like.
The AEFS 13.100 and the conferencing devices 13.120 are communicatively coupled to one another via the communication system 13.150. The AEFS 13.100 is also communicatively coupled to speaker-related information sources 130, including messages 13.130a, documents 13.130b, and audio data 13.130c. The AEFS 13.100 uses the information in the information sources 13.130, in conjunction with data received from the conferencing devices 13.120, to determine information related to the voice conference, including speaker-related information associated with the speakers 13.102.
In the scenario illustrated in
The AEFS 13.100 receives data representing a speech signal that represents the utterance 13.110, such as by receiving a digital representation of an audio signal transmitted by conferencing device 13.120b. The data representing the speech signal may include audio samples (e.g., raw audio data), compressed audio data, speech vectors (e.g., mel frequency cepstral coefficients), and/or any other data that may be used to represent an audio signal. The AEFS 13.100 may receive the data in various ways, including from one or more of the conferencing devices or from some intermediate system (e.g., a voice conferencing system that is facilitating the conference between the conferencing devices 13.120).
The AEFS 13.100 then determines speaker-related information associated with the speaker 13.102b. Determining speaker-related information may include identifying the speaker 13.102b based on the received data representing the speech signal. In some embodiments, identifying the speaker may include performing speaker recognition, such as by generating a “voice print” from the received data and comparing the generated voice print to previously obtained voice prints. For example, the generated voice print may be compared to multiple voice prints that are stored as audio data 13.130c and that each correspond to a speaker, in order to determine a speaker who has a voice that most closely matches the voice of the speaker 13.102b. The voice prints stored as audio data 13.130c may be generated based on various sources of data, including data corresponding to speakers previously identified by the AEFS 13.100, voice mail messages, speaker enrollment data, or the like.
In some embodiments, identifying the speaker 13.102b may include performing speech recognition, such as by automatically converting the received data representing the speech signal into text. The text of the speaker's utterance may then be used to identify the speaker 13.102b. In particular, the text may identify one or more entities such as information items (e.g., communications, documents), events (e.g., meetings, deadlines), persons, or the like, that may be used by the AEFS 13.100 to identify the speaker 13.102b. The information items may be accessed with reference to the messages 13.130a and/or documents 13.130b. As one example, the speaker's utterance 13.110 may identify an email message that was sent to the speaker 13.102b and possibly others (e.g., “That sure was a nasty email Bob sent”). As another example, the speaker's utterance 13.110 may identify a meeting or other event to which the speaker 13.102b and possibly others are invited.
Note that in some cases, the text of the speaker's utterance 13.110 may not definitively identify the speaker 13.102b, such as because the speaker 13.102b has not previously met or communicated with other participants in the voice conference or because a communication was sent to recipients in addition to the speaker 13.102b. In such cases, there may be some ambiguity as to the identity of the speaker 13.102b. However, in such cases, a preliminary identification of multiple candidate speakers may still be used by the AEFS 13.100 to narrow the set of potential speakers, and may be combined with (or used to improve) other techniques, including speaker recognition as discussed above. In addition, even if the speaker 13.102 is unknown to the user 13.102a the AEFS 13.100 may still determine useful demographic or other speaker-related information that may be fruitfully employed for speech recognition or other purposes.
Note also that speaker-related information need not definitively identify the speaker. In particular, it may also or instead be or include other information about or related to the speaker, such as demographic information including the gender of the speaker 13.102, his country or region of origin, the language(s) spoken by the speaker 13.102, or the like. Speaker-related information may include an organization that includes the speaker (along with possibly other persons, such as a company or firm), an information item that references the speaker (and possibly other persons), an event involving the speaker, or the like. The speaker-related information may generally be determined with reference to the messages 13.130a, documents 13.130b, and/or audio data 13.130c. For example, having determined the identity of the speaker 13.102, the AEFS 13.100 may search for emails and/or documents that are stored as messages 13.130a and/or documents 13.103b and that reference (e.g., are sent to, are authored by, are named in) the speaker 13.102.
Other types of speaker-related information is contemplated, including social networking information, such as personal or professional relationship graphs represented by a social networking service, messages or status updates sent within a social network, or the like. Social networking information may also be derived from other sources, including email lists, contact lists, communication patterns (e.g., frequent recipients of emails), or the like.
The AEFS 13.100 then informs the user (speaker 13.102a) of the determined speaker-related information. Informing the user may include audibly presenting the information to the user via an audio speaker of the conferencing device 13.120a. In this example, the conferencing device 13.120a tells the user, such as by playing audio via an earpiece or in another manner that cannot be detected by the other participants in the voice conference, that speaker 13.102b is currently speaking. In particular, the conferencing device 13.120a plays audio that includes the utterance “Bill speaking” to the user.
Informing the user of the determined speaker-related information may also or instead include visually presenting the information, such as via the display 13.121 or audio speaker of conferencing device 13.120a. In the illustrated example, the AEFS 13.100 causes a message 13.112 that includes text of an email from Bill (speaker 13.102b) to be displayed on the display 13.121. In this example, the displayed email includes a statement from Bill (speaker 13.102b) that sets the project deadline to next week, not tomorrow. Upon reading the message 13.112 and thereby learning the actual project deadline, the speaker 13.102a responds to the original utterance 13.110 of speaker 13.102b (Bill) with a response utterance 13.114 that includes the words “Not according to your email, Bill.” In the illustrated example, speaker 13.102c, upon hearing the utterance 13.114, responds with an utterance 13.115 that includes the words “I agree with Joe,” indicating his agreement with speaker 13.102a.
As the speakers 13.102a-102c continue to engage in the voice conference, the AEFS 13.100 may monitor the conversation and continue to determine and present speaker-related information at least to the speaker 13.102a. Another example function that may be performed by the AEFS 13.100 includes presenting, as each of the multiple speakers takes a turn speaking during the voice conference, information about the identity of the current speaker. For example, in response to the onset of an utterance of a speaker, the AEFS 13.100 may display the name of the speaker on the display 13.121, so that the user is always informed as to who is speaking.
The AEFS 13.100 may perform other services, including translating utterances made by speakers in the voice conference, so that a multi-lingual voice conference may be conducted even between participants who do not understand all of the languages being spoken. Translating utterances may initially include determining speaker-related information by automatically determining the language that is being used by a current speaker. Determining the language may be based on signal processing techniques that identify signal characteristics unique to particular languages. Determining the language may also or instead be performed by simultaneous or concurrent application of multiple speech recognizers that are each configured to recognize speech in a corresponding language, and then choosing the language corresponding to the recognizer that produces the result having the highest confidence level. Determining the language may also or instead be based on contextual factors, such as GPS information indicating that the current speaker is in Germany, Austria, or some other region where German is commonly spoken.
Having determined speaker-related information, the AEFS 13.100 may then translate an utterance in a first language into an utterance in a second language. In some embodiments, the AEFS 13.100 translates an utterance by first performing speech recognition to translate the utterance into a textual representation that includes a sequence of words in the first language. Then, the AEFS 13.100 may translate the text in the first language into a message in a second language, using machine translation techniques. Speech recognition and/or machine translation may be modified, enhanced, and/or otherwise adapted based on the speaker-related information. For example, a speech recognizer may use speech or language models tailored to the speaker's gender, accent/dialect (e.g., determined based on country/region of origin), social class, or the like. As another example, a lexicon that is specific to the speaker may be used during speech recognition and/or language translation. Such a lexicon may be determined based on prior communications of the speaker, profession of the speaker (e.g., engineer, attorney, doctor), or the like.
Once the AEFS 13.100 has translated an utterance in a first language into a message in a second language, the AEFS 13.100 can present the message in the second language. Various techniques are contemplated. In one approach, the AEFS 13.100 causes the conferencing device 13.120a (or some other device accessible to the user) to visually display the message on the display 13.121. In another approach, the AEFS 13.100 causes the conferencing device 13.120a (or some other device) to “speak” or “tell” the user/speaker 13.102a the message in the second language. Presenting a message in this manner may include converting a textual representation of the message into audio via text-to-speech processing (e.g., speech synthesis), and then presenting the audio via an audio speaker (e.g., earphone, earpiece, earbud) of the conferencing device 13.120a.
As an initial matter, note that the AEFS 13.100 may use output devices of a conferencing device or other devices to present information to a user, such as speaker-related information that may generally assist the user in engaging in a voice conference with other participants. For example, the AEFS 13.100 may present speaker-related information about a current speaker, such as his name, title, communications that reference or are related to the speaker, and the like.
For audio output, each of the illustrated conferencing devices 13.120 may include or be communicatively coupled to an audio speaker operable to generate and output audio signals that may be perceived by the user 13.102. As discussed above, the AEFS 13.100 may use such a speaker to provide speaker-related information to the user 13.102. The AEFS 13.100 may also or instead audibly notify, via a speaker of a conferencing device 13.120, the user 13.102 to view speaker-related information displayed on the conferencing device 13.120. For example, the AEFS 13.100 may cause a tone (e.g., beep, chime) to be played via the earpiece of the telephone 13.120f. Such a tone may then be recognized by the user 13.102, who will in response attend to information displayed on the display 13.121c. Such audible notification may be used to identify a display that is being used as a current display, such as when multiple displays are being used. For example, different first and second tones may be used to direct the user's attention to the smart phone display 13.121a and laptop display 13.121b, respectively. In some embodiments, audible notification may include playing synthesized speech (e.g., from text-to-speech processing) telling the user 13.102 to view speaker-related information on a particular display device (e.g., “Recent email on your smart phone”).
The AEFS 13.100 may generally cause speaker-related information (or other information including translations) to be presented on various destination output devices. In some embodiments, the AEFS 13.100 may use a display of a conferencing device as a target for displaying information. For example, the AEFS 13.100 may display speaker-related information on the display 13.121a of the smart phone 13.120d. On the other hand, when the conferencing device does not have its own display or if the display is not suitable for displaying the determined information, the AEFS 13.100 may display speaker-related information on some other destination display that is accessible to the user 13.102. For example, when the telephone 13.120f is the conferencing device and the user also has the laptop computer 13.120e in his possession, the AEFS 13.100 may elect to display an email or other substantial document upon the display 13.121b of the laptop computer 13.120e.
The AEFS 13.100 may determine a destination output device for a translation, speaker-related information, or other information. In some embodiments, determining a destination output device may include selecting from one of multiple possible destination displays based on whether a display is capable of displaying all of the information. For example, if the environment is noisy, the AEFS may elect to visually display a translation rather than play it through a speaker. As another example, if the user 13.102 is proximate to a first display that is capable of displaying only text and a second display capable of displaying graphics, the AEFS 13.100 may select the second display when the presented information includes graphics content (e.g., an image). In some embodiments, determining a destination display may include selecting from one of multiple possible destination displays based on the size of each display. For example, a small LCD display (such as may be found on a mobile phone or telephone 13.120f) may be suitable for displaying a message that is just a few characters (e.g., a name or greeting) but not be suitable for displaying longer message or large document. Note that the AEFS 13.100 may select among multiple potential target output devices even when the conferencing device itself includes its own display and/or speaker.
Determining a destination output device may be based on other or additional factors. In some embodiments, the AEFS 13.100 may use user preferences that have been inferred (e.g., based on current or prior interactions with the user 13.102) and/or explicitly provided by the user. For example, the AEFS 13.100 may determine to present a translation, an email, or other speaker-related information onto the display 13.121a of the smart phone 13.120d based on the fact that the user 13.102 is currently interacting with the smart phone 13.120d.
Note that although the AEFS 13.100 is shown as being separate from a conferencing device 13.120, some or all of the functions of the AEFS 13.100 may be performed within or by the conferencing device 13.120 itself. For example, the smart phone conferencing device 13.120d and/or the laptop computer conferencing device 13.120e may have sufficient processing power to perform all or some functions of the AEFS 13.100, including one or more of speaker identification, determining speaker-related information, speaker recognition, speech recognition, language translation, presenting information, or the like. In some embodiments, the conferencing device 13.120 includes logic to determine where to perform various processing tasks, so as to advantageously distribute processing between available resources, including that of the conferencing device 13.120, other nearby devices (e.g., a laptop or other computing device of the user 13.102), remote devices (e.g., “cloud-based” processing and/or storage), and the like.
Other types of conferencing devices and/or organizations are contemplated. In some embodiments, the conferencing device may be a “thin” device, in that it may serve primarily as an output device for the AEFS 13.100. For example, an analog telephone may still serve as a conferencing device, with the AEFS 13.100 presenting speaker-related information via the earpiece of the telephone. As another example, a conferencing device may be or be part of a desktop computer, PDA, tablet computer, or the like.
The speech and language engine 14.210 includes a speech recognizer 14.212, a speaker recognizer 14.214, a natural language processor 14.216, and a language translation processor 14.218. The speech recognizer 14.212 transforms speech audio data received (e.g., from the conferencing device 13.120) into textual representation of an utterance represented by the speech audio data. In some embodiments, the performance of the speech recognizer 14.212 may be improved or augmented by use of a language model (e.g., representing likelihoods of transitions between words, such as based on n-grams) or speech model (e.g., representing acoustic properties of a speaker's voice) that is tailored to or based on an identified speaker. For example, once a speaker has been identified, the speech recognizer 14.212 may use a language model that was previously generated based on a corpus of communications and other information items authored by the identified speaker. A speaker-specific language model may be generated based on a corpus of documents and/or messages authored by a speaker. Speaker-specific speech models may be used to account for accents or channel properties (e.g., due to environmental factors or communication equipment) that are specific to a particular speaker, and may be generated based on a corpus of recorded speech from the speaker. In some embodiments, multiple speech recognizers are present, each one configured to recognize speech in a different language.
The speaker recognizer 14.214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the conferencing device 13.120. The speaker recognizer 14.214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 14.240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data.
The natural language processor 14.216 processes text generated by the speech recognizer 14.212 and/or located in information items obtained from the speaker-related information sources 13.130. In doing so, the natural language processor 14.216 may identify relationships, events, or entities (e.g., people, places, things) that may facilitate speaker identification, language translation, and/or other functions of the AEFS 13.100. For example, the natural language processor 14.216 may process status updates posted by the user 13.102a on a social networking service, to determine that the user 13.102a recently attended a conference in a particular city, and this fact may be used to identify a speaker and/or determine other speaker-related information, which may in turn be used for language translation or other functions.
The language translation processor 14.218 translates from one language to another, for example, by converting text in a first language to text in a second language. The text input to the language translation processor 14.218 may be obtained from, for example, the speech recognizer 14.212 and/or the natural language processor 14.216. The language translation processor 14.218 may use speaker-related information to improve or adapt its performance. For example, the language translation processor 14.218 may use a lexicon or vocabulary that is tailored to the speaker, such as may be based on the speaker's country/region of origin, the speaker's social class, the speaker's profession, or the like.
The agent logic 14.220 implements the core intelligence of the AEFS 13.100. The agent logic 14.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to identify speakers, determine speaker-related information, and the like. For example, the agent logic 14.220 may combine spoken text from the speech recognizer 14.212, a set of potentially matching (candidate) speakers from the speaker recognizer 14.214, and information items from the information sources 13.130, in order to determine a most likely identity of the current speaker. As another example, the agent logic 14.220 may identify the language spoken by the speaker by analyzing the output of multiple speech recognizers that are each configured to recognize speech in a different language, to identify the language of the speech recognizer that returns the highest confidence result as the spoken language.
The presentation engine 14.230 includes a visible output processor 14.232 and an audible output processor 14.234. The visible output processor 14.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the conferencing device 13.120 or some other display (e.g., a desktop or laptop display in proximity to the user 13.102a). The agent logic 14.220 may use or invoke the visible output processor 14.232 to prepare and display information, such as by formatting or otherwise modifying a translation or some speaker-related information to fit on a particular type or size of display. The audible output processor 14.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 14.220 may use or invoke the audible output processor 14.234 in order to convert a textual message (e.g., including or referencing speaker-related information) into audio output suitable for presentation via the conferencing device 13.120, for example by employing a text-to-speech processor.
Note that although speaker identification and/or determining speaker-related information is herein sometimes described as including the positive identification of a single speaker, it may instead or also include determining likelihoods that each of one or more persons is the current speaker. For example, the speaker recognizer 14.214 may provide to the agent logic 14.220 indications of multiple candidate speakers, each having a corresponding likelihood or confidence level. The agent logic 14.220 may then select the most likely candidate based on the likelihoods alone or in combination with other information, such as that provided by the speech recognizer 14.212, natural language processor 14.216, speaker-related information sources 13.130, or the like. In some cases, such as when there are a small number of reasonably likely candidate speakers, the agent logic 14.220 may inform the user 13.102a of the identities all of the candidate speakers (as opposed to a single speaker) candidate speaker, as such information may be sufficient to trigger the user's recall and enable the user to make a selection that informs the agent logic 14.220 of the speaker's identity.
Note that in some embodiments, one or more of the illustrated components, or components of different types, may be included or excluded. For example, in one embodiment, the AEFS 13.100 does not include the language translation processor 14.218.
FIGS. 15.1-15.108 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 15.101, the process performs receiving data representing speech signals from a voice conference amongst multiple speakers, wherein the multiple speakers include at least three speakers. The voice conference may be, for example, taking place between multiple speakers who are engaged in a conference call. The received data may be or represent one or more speech signals (e.g., audio samples) and/or higher-order information (e.g., frequency coefficients). The data may be received by or at the conferencing device 13.120 and/or the AEFS 13.100.
At block 15.102, the process performs determining speaker-related information associated with each of the multiple speakers, based on the data representing speech signals from the voice conference. The speaker-related information may include identifiers of a speaker (e.g., names, titles) and/or related information, such as documents, emails, calendar events, or the like. The speaker-related information may also or instead include demographic information about a speaker, including gender, language spoken, country of origin, region of origin, or the like. The speaker-related information may be determined based on signal properties of speech signals (e.g., a voice print) and/or on the semantic content of the speech signal, such as a name, event, entity, or information item that was mentioned by a speaker.
At block 15.103, the process performs presenting the speaker-related information via a conferencing device associated with a user. The speaker-related information may be presented on a display of the conferencing device (if it has one) or on some other display, such as a laptop or desktop display that is proximately located to the user. The speaker-related information may be presented in an audible and/or visible manner.
At block 15.201, the process performs receiving data representing speech signals from a voice conference amongst multiple speakers, wherein the multiple speakers are remotely located from one another. In some embodiments, the multiple speakers are remotely located from one another. Two speakers may be remotely located from one another even though they are in the same building or at the same site (e.g., campus, cluster of buildings), such as when the speakers are in different rooms, cubicles, or other locations within the site or building. In other cases, two speakers may be remotely located from one another by being in different cities, states, regions, or the like.
At block 15.301, the process performs as each of the multiple speakers takes a turn speaking during the voice conference, presenting speaker-related information associated with the speaker. The process may, in substantially real time, provide the user speaker-related information associated a current speaker, such as a name of the speaker, a message sent by the speaker, or the like. The presented information may be updated throughout the voice conference based on the identity of the current speaker. For example, the process may present the three most recent emails sent by the current speaker.
At block 15.401, the process performs in response to one of the speakers beginning to speak during the voice conference, presenting the speaker-related information associated with the speaker. In some embodiments, the onset of speech may trigger the display or update of speaker-related information. The onset of speech may be detected in various ways, including via endpoint detection and/or frequency analysis.
At block 15.501, the process performs presenting the speaker-related information during a telephone conference call amongst the multiple speakers. In some embodiments, the process operates to facilitate a telephone conference, even some or all of the speakers are using POTS (plain old telephone service) telephones.
At block 15.601, the process performs presenting, while a current speaker is speaking, speaker-related information on a display device of the user, the displayed speaker-related information identifying the current speaker. For example, as the user engages in a conference call from his office, the process may present the name or other information about the current speaker on a display of a desktop computer in the office of the user.
At block 15.701, the process performs receiving audio data from a telephone conference call that includes the multiple speakers, the received audio data representing utterances made by at least one of the multiple speakers. In some embodiments, the process may function in the context of a telephone conference, such as by receiving audio data from a system that facilitates the telephone conference, including a physical or virtual PBX (private branch exchange), a voice over IP conference system, or the like.
At block 15.801, the process performs receiving audio data from an online audio chat that includes the multiple speakers, the received audio data representing utterances made by at least one of the multiple speakers. In some embodiments, the process may function in the context of an online audio chat, such as may be supported by an online meeting system.
At block 15.901, the process performs receiving audio data from a video conference that includes the multiple speakers, the received audio data representing utterances made by at least one of the multiple speakers. In some embodiments, the process may function in the context of a video conference, such as may be facilitated by a dedicated system, a community of video enabled computing devices communicating via the Internet, or the like.
At block 15.1001, the process performs receiving data representing speech signals from the at least three speakers, the data obtained at the conferencing device. In some embodiments, the process may obtain data from a conferencing device itself. In other cases, the process may obtain the data from an intermediary source or location.
At block 15.1101, the process performs determining which one of the multiple speakers is speaking during a time interval. The process may determine which one of the speakers is currently speaking, even if the identity of the current speaker is not known. Various approaches may be employed, including detecting the source of a speech signal, performing voice identification, or the like.
At block 15.1201, the process performs associating a first portion of the received data with a first one of the multiple speakers. The process may correspond, bind, link, or similarly associate a portion of the received data with a speaker. Such an association may then be used for further processing, such as voice identification, speech recognition, or the like.
At block 15.1301, the process performs receiving the first portion of the received data along with an identifier associated with the first speaker. In some embodiments, the process may receive data along with an identifier, such as an IP address (e.g., in a voice over IP conferencing system).
At block 15.1401, the process performs receiving a network identifier associated with the first speaker.
At block 15.1501, the process performs receiving from a conferencing system the identifier associated with the first speaker, the conferencing system configured to facilitate a conference call among the multiple speakers. Some conferencing systems may provide an identifier (e.g., telephone number) of a current speaker by detecting which telephone line or other circuit (virtual or physical) has an active signal.
At block 15.1601, the process performs selecting the first portion based on the first portion representing only speech from the one speaker and no other of the multiple speakers. The process may select a portion of the received data based on whether or not the received data includes speech from only one, or more than one speaker (e.g., when multiple speakers are talking over each other).
At block 15.1701, the process performs determining that two or more of the multiple speakers are speaking concurrently. The process may determine the multiple speakers are talking at the same time, and take action accordingly. For example, the process may elect not to attempt to identify any speaker, or instead identify all of the speakers who are talking out of turn.
At block 15.1801, the process performs performing voice identification to select which one of multiple previously analyzed voices is a best match for the one speaker who is speaking during the time interval. As noted, voice identification may be employed to determine the current speaker.
At block 15.1901, the process performs performing voice identification based on the received data to identify one of the multiple speakers. In some embodiments, voice identification may include generating a voice print, voice model, or other biometric feature set that characterizes the voice of the speaker, and then comparing the generated voice print to previously generated voice prints.
At block 15.2001, the process performs comparing properties of the speech signal with properties of previously recorded speech signals from multiple persons. In some embodiments, the process accesses voice prints associated with multiple persons, and determines a best match against the speech signal.
At block 15.2101, the process performs processing voice messages from the multiple persons to generate voice print data for each of the multiple persons. Given a telephone voice message, the process may associate generated voice print data for the voice message with one or more (direct or indirect) identifiers corresponding with the message. For example, the message may have a sender telephone number associated with it, and the process can use that sender telephone number to do a reverse directory lookup (e.g., in a public directory, in a personal contact list) to determine the name of the voice message speaker.
At block 15.2201, the process performs processing telephone voice messages stored by a voice mail service. In some embodiments, the process analyzes voice messages to generate voice prints/models for multiple persons.
At block 15.2301, the process performs performing speech recognition to convert the received data into text data. For example, the process may convert the received data into a sequence of words that are (or are likely to be) the words uttered by a speaker.
At block 15.2302, the process performs identifying one of the multiple speakers based on the text data. Given text data (e.g., words spoken by a speaker), the process may search for information items that include the text data, and then identify the one speaker based on those information items, as discussed further below.
At block 15.2401, the process performs finding an information item that references the one speaker and that includes one or more words in the text data. In some embodiments, the process may search for and find a document or other item (e.g., email, text message, status update) that includes words spoken by one speaker. Then, the process can infer that the one speaker is the author of the document, a recipient of the document, a person described in the document, or the like.
At block 15.2501, the process performs performing speech recognition based on cepstral coefficients that represent the speech signal. In other embodiments, other types of features or information may be also or instead used to perform speech recognition, including language models, dialect models, or the like.
At block 15.2601, the process performs performing hidden Markov model-based speech recognition. Other approaches or techniques for speech recognition may include neural networks, stochastic modeling, or the like.
At block 15.2701, the process performs retrieving information items that reference the text data. The process may here retrieve or otherwise obtain documents, calendar events, messages, or the like, that include, contain, or otherwise reference some portion of the text data.
At block 15.2702, the process performs informing the user of the retrieved information items.
At block 15.2801, the process performs performing speech recognition based at least in part on a language model associated with the one speaker. A language model may be used to improve or enhance speech recognition. For example, the language model may represent word transition likelihoods (e.g., by way of n-grams) that can be advantageously employed to enhance speech recognition. Furthermore, such a language model may be speaker specific, in that it may be based on communications or other information generated by the one speaker.
At block 15.2901, the process performs generating the language model based on information items generated by the one speaker, the information items including at least one of emails transmitted by the one speaker, documents authored by the one speaker, and/or social network messages transmitted by the one speaker. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like to generate a language model that is specific or otherwise tailored to the one speaker.
At block 15.3001, the process performs generating the language model based on information items generated by or referencing any of the multiple speakers, the information items including emails, documents, and/or social network messages. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like generated by or referencing any of the multiple speakers to generate a language model that is tailored to the current conversation.
At block 15.3101, the process performs receiving data representing a speech signal that represents an utterance of the user. A microphone on or about the conferencing device may capture this data. The microphone may be the same or different from one used to capture speech data from the conversation.
At block 15.3102, the process performs identifying one of the multiple speakers based on the data representing a speech signal that represents an utterance of the user. Identifying the one speaker in this manner may include performing speech recognition on the user's utterance, and then processing the resulting text data to locate a name. This identification can then be utilized to retrieve information items or other speaker-related information that may be useful to present to the user.
At block 15.3201, the process performs determining whether the utterance of the user includes a name of the one speaker.
At block 15.3301, the process performs receiving context information related to the user. Context information may generally include information about the setting, location, occupation, communication, workflow, or other event or factor that is present at, about, or with respect to the user.
At block 15.3302, the process performs determining speaker-related information, based on the context information. Context information may be used to determine speaker-related information, such as by determining or narrowing a set of potential speakers based on the current location of the user
At block 15.3401, the process performs receiving an indication of a location of the user.
At block 15.3402, the process performs determining a plurality of persons with whom the user commonly interacts at the location. For example, if the indicated location is a workplace, the process may generate a list of co-workers, thereby reducing or simplifying the problem of speaker identification.
At block 15.3501, the process performs receiving a GPS location from a mobile device of the user.
At block 15.3601, the process performs receiving a network identifier that is associated with the location. The network identifier may be, for example, a service set identifier (“SSID”) of a wireless network with which the user is currently associated.
At block 15.3701, the process performs receiving an indication that the user is at a workplace or a residence. For example, the process may translate a coordinate-based location (e.g., GPS coordinates) to a particular workplace by performing a map lookup or other mechanism.
At block 15.3801, the process performs receiving information about an information item that references one of the multiple speakers. As noted, context information may include information items, such as documents, messages, calendar events, or the like. In this case, the process may exploit such information items to improve speaker identification or other operations.
At block 15.3901, the process performs developing a corpus of speaker data by recording speech from multiple persons.
At block 15.3902, the process performs identifying one of the multiple speakers based at least in part on the corpus of speaker data. Over time, the process may gather and record speech obtained during its operation, and then use that speech as part of a corpus that is used during future operation. In this manner, the process may improve its performance by utilizing actual, environmental speech data, possibly along with feedback received from the user, as discussed below.
At block 15.4001, the process performs generating a speech model associated with each of the multiple persons, based on the recorded speech. The generated speech model may include voice print data that can be used for speaker identification, a language model that may be used for speech recognition purposes, a noise model that may be used to improve operation in speaker-specific noisy environments.
At block 15.4101, the process performs receiving feedback regarding accuracy of the speaker-related information. During or after providing speaker-related information to the user, the user may provide feedback regarding its accuracy. This feedback may then be used to train a speech processor (e.g., a speaker identification module, a speech recognition module). Feedback may be provided in various ways, such as by processing positive/negative utterances from a speaker (e.g., “That is not my name”), receiving a positive/negative utterance from the user (e.g., “I am sorry.”), receiving a keyboard/button event that indicates a correct or incorrect identification.
At block 15.4102, the process performs training a speech processor based at least in part on the received feedback.
At block 15.4201, the process performs presenting the speaker-related information on a display of the conferencing device. In some embodiments, the conferencing device may include a display. For example, where the conferencing device is a smart phone or laptop computer, the conferencing device may include a display that provides a suitable medium for presenting the name or other identifier of the speaker.
At block 15.4301, the process performs presenting the speaker-related information on a display of a computing device that is distinct from the conferencing device. In some embodiments, the conferencing device may not itself include a display. For example, where the conferencing device is an office phone, the process may elect to present the speaker-related information on a display of a nearby computing device, such as a desktop or laptop computer in the vicinity of the phone.
At block 15.4401, the process performs determining a display to serve as a presentation device for the speaker-related information. In some embodiments, there may be multiple displays available as possible destinations for the speaker-related information. For example, in an office setting, where the conferencing device is an office phone, the office phone may include a small LCD display suitable for displaying a few characters or at most a few lines of text. However, there will typically be additional devices in the vicinity of the conferencing device, such as a desktop/laptop computer, a smart phone, a PDA, or the like. The process may determine to use one or more of these other display devices, possibly based on the type of the speaker-related information being displayed.
At block 15.4501, the process performs selecting one display from multiple displays, based at least in part on whether each of the multiple displays is capable of displaying all of the speaker-related information. In some embodiments, the process determines whether all of the speaker-related information can be displayed on a given display. For example, where the display is a small alphanumeric display on an office phone, the process may determine that the display is not capable of displaying a large amount of speaker-related information.
At block 15.4601, the process performs selecting one display from multiple displays, based at least in part on a size of each of the multiple displays. In some embodiments, the process considers the size (e.g., the number of characters or pixels that can be displayed) of each display.
At block 15.4701, the process performs selecting one display from multiple displays, based at least in part on whether each of the multiple displays is suitable for displaying the speaker-related information, the speaker-related information being at least one of text information, a communication, a document, an image, and/or a calendar event. In some embodiments, the process considers the type of the speaker-related information. For example, whereas a small alphanumeric display on an office phone may be suitable for displaying the name of the speaker, it would not be suitable for displaying an email message sent by the speaker.
At block 15.4801, the process performs audibly notifying the user to view the speaker-related information on a display device. In some embodiments, notifying the user may include playing a tone, such as a beep, chime, or other type of notification. In some embodiments, notifying the user may include playing synthesized speech telling the user to view the display device. For example, the process may perform text-to-speech processing to generate audio of a textual message or notification, and this audio may then be played or otherwise output to the user via the conferencing device. In some embodiments, notifying the user may telling the user that a document, calendar event, communication, or the like is available for viewing on the display device. Telling the user about a document or other speaker-related information may include playing synthesized speech that includes an utterance to that effect. In some embodiments, the process may notify the user in a manner that is not audible to at least some of the multiple speakers. For example, a tone or verbal message may be output via an earpiece speaker, such that other parties to the conversation do not hear the notification. As another example, a tone or other notification may be into the earpiece of a telephone, such as when the process is performing its functions within the context of a telephonic conference call.
At block 15.4901, the process performs informing the user of an identifier of each of the multiple speakers. In some embodiments, the identifier of each of the speakers may be or include a given name, surname (e.g., last name, family name), nickname, title, job description, or other type of identifier of or associated with the speaker.
At block 15.5001, the process performs informing the user of information aside from identifying information related to the multiple speakers. In some embodiments, information aside from identifying information may include information that is not a name or other identifier (e.g., job title) associated with the speaker. For example, the process may tell the user about an event or communication associated with or related to the speaker.
At block 15.5101, the process performs informing the user of an organization to which each of the multiple speakers belongs. In some embodiments, informing the user of an organization may include notifying the user of a business, group, school, club, team, company, or other formal or informal organization with which a speaker is affiliated. Companies may include profit or non-profit entities, regardless of organizational structure (e.g., corporation, partnerships, sole proprietorship).
At block 15.5201, the process performs informing the user of a previously transmitted communication referencing one of the multiple speakers. Various forms of communication are contemplated, including textual (e.g., emails, text messages, chats), audio (e.g., voice messages), video, or the like. In some embodiments, a communication can include content in multiple forms, such as text and audio, such as when an email includes a voice attachment.
At block 15.5301, the process performs informing the user of at least one of: an email transmitted between the one speaker and the user and/or a text message transmitted between the one speaker and the user. An email transmitted between the one speaker and the user may include an email sent from the one speaker to the user, or vice versa. Text messages may include short messages according to various protocols, including SMS, MMS, and the like.
At block 15.5401, the process performs informing the user of an event involving the user and one of the multiple speakers. An event may be any occurrence that involves or involved the user and a speaker, such as a meeting (e.g., social or professional meeting or gathering) attended by the user and the speaker, an upcoming deadline (e.g., for a project), or the like.
At block 15.5501, the process performs informing the user of a previously occurring event and/or a future event that is at least one of a project, a meeting, and/or a deadline.
At block 15.5601, the process performs accessing information items associated with one of the multiple speakers. In some embodiments, accessing information items associated with one of the multiple speakers may include retrieving files, documents, data records, or the like from various sources, such as local or remote storage devices, cloud-based servers, and the like. In some embodiments, accessing information items may also or instead include scanning, searching, indexing, or otherwise processing information items to find ones that include, name, mention, or otherwise reference a speaker.
At block 15.5701, the process performs searching for information items that reference the one speaker, the information items including at least one of a document, an email, and/or a text message. In some embodiments, searching may include formulating a search query to provide to a document management system or any other data/document store that provides a search interface. In some embodiments, emails or text messages that reference the one speaker may include messages sent from the one speaker, messages sent to the one speaker, messages that name or otherwise identify the one speaker in the body of the message, or the like.
At block 15.5801, the process performs accessing a social networking service to find messages or status updates that reference the one speaker. In some embodiments, accessing a social networking service may include searching for postings, status updates, personal messages, or the like that have been posted by, posted to, or otherwise reference the one speaker. Example social networking services include Facebook, Twitter, Google Plus, and the like. Access to a social networking service may be obtained via an API or similar interface that provides access to social networking data related to the user and/or the one speaker.
At block 15.5901, the process performs accessing a calendar to find information about appointments with the one speaker. In some embodiments, accessing a calendar may include searching a private or shared calendar to locate a meeting or other appointment with the one speaker, and providing such information to the user via the conferencing device.
At block 15.6001, the process performs accessing a document store to find documents that reference the one speaker. In some embodiments, documents that reference the one speaker include those that are authored at least in part by the one speaker, those that name or otherwise identify the speaker in a document body, or the like. Accessing the document store may include accessing a local or remote storage device/system, accessing a document management system, accessing a source control system, or the like.
At block 15.6101, the process performs transmitting the speaker-related information from a first device to a second device having a display. In some embodiments, at least some of the processing may be performed on distinct devices, resulting in a transmission of speaker-related information from one device to another device, for example from a desktop computer to the conferencing device.
At block 15.6201, the process performs wirelessly transmitting the speaker-related information. Various protocols may be used, including Bluetooth, infrared, WiFi, or the like.
At block 15.6301, the process performs transmitting the speaker-related information from a smart phone to the second device. For example a smart phone may forward the speaker-related information to a desktop computing system for display on an associated monitor.
At block 15.6401, the process performs transmitting the speaker-related information from a server system to the second device. In some embodiments, some portion of the processing is performed on a server system that may be remote from the conferencing device.
At block 15.6501, the process performs transmitting the speaker-related information from a server system that resides in a data center.
At block 15.6601, the process performs transmitting the speaker-related information from a server system to a desktop computer, a laptop computer, a mobile device, or a desktop telephone of the user.
At block 15.6701, the process performs performing the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information, and/or the presenting the speaker-related information on a mobile device that is operated by the user. As noted, In some embodiments a computer or mobile device such as a smart phone may have sufficient processing power to perform a portion of the process, such as identifying a speaker, determining the speaker-related information, or the like.
At block 15.6801, the process performs determining speaker-related information, performed on a smart phone or a media player that is operated by the user.
At block 15.6901, the process performs performing the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information, and/or the presenting the speaker-related information on a desktop computer that is operated by the user. For example, in an office setting, the user's desktop computer may be configured to perform some or all of the process.
At block 15.7001, the process performs determining to perform at least some of determining speaker-related information or presenting the speaker-related information on another computing device that has available processing capacity. In some embodiments, the process may determine to offload some of its processing to another computing device or system.
At block 15.7101, the process performs receiving at least some of speaker-related information from the another computing device. The process may receive the speaker-related information or a portion thereof from the other computing device.
At block 15.7201, the process performs determining whether or not the user can name one of the multiple speakers.
At block 15.7202, the process performs when it is determined that the user cannot name the one speaker, presenting the speaker-related information. In some embodiments, the process only informs the user of the speaker-related information upon determining that the user does not appear to be able to name a particular speaker.
At block 15.7301, the process performs determining whether the user has named the one speaker. In some embodiments, the process listens to the user to determine whether the user has named the speaker.
At block 15.7401, the process performs determining whether the user has uttered a given name, surname, or nickname of the one speaker.
At block 15.7501, the process performs determining whether the user has uttered a name of a relationship between the user and the one speaker. In some embodiments, the user need not utter the name of the speaker, but instead may utter other information (e.g., a relationship) that may be used by the process to determine that user knows or can name the speaker.
At block 15.7601, the process performs determining whether the user has uttered information that is related to both the one speaker and the user.
At block 15.7701, the process performs determining whether the user has named a person, place, thing, or event that the one speaker and the user have in common. For example, the user may mention a visit to the home town of the speaker, a vacation to a place familiar to the speaker, or the like
At block 15.7801, the process performs performing speech recognition to convert an utterance of the user into text data. The process may perform speech recognition on utterances of the user, and then examine the resulting text to determine whether the user has uttered a name or other information about the speaker.
At block 15.7802, the process performs determining whether or not the user can name one of the multiple speakers based at least in part on the text data.
At block 15.7901, the process performs when the user does not name the one speaker within a predetermined time interval, determining that the user cannot name the one speaker. In some embodiments, the process waits for a time period before jumping in to provide the speaker-related information.
At block 15.8001, the process performs translating an utterance of one of the multiple speakers in a first language into a message in a second language, based on the speaker-related information. In some embodiments, the process may also perform language translation, such that a voice conference may be held between speakers of different languages. In some embodiments, the utterance may be translated by first performing speech recognition on the data representing the speech signal to convert the utterance into textual form. Then, the text of the utterance may be translated into the second language using a natural language processing and/or machine translation techniques. The speaker-related information may be used to improve, enhance, or otherwise modify the process of machine translation. For example, based on the identity of the one speaker, the process may use a language or speech model that is tailored to the one speaker in order to improve a machine translation process. As another example, the process may use one or more information items that reference the one speaker to improve machine translation, such as by disambiguating references in the utterance of the one speaker.
At block 15.8002, the process performs presenting the message in the second language. The message may be presented in various ways including using audible output (e.g., via text-to-speech processing of the message) and/or using visible output of the message (e.g., via a display screen of the conferencing device or some other device that is accessible to the user).
At block 15.8101, the process performs determining the first language. In some embodiments, the process may determine or identify the first language, possibly prior to performing language translation. For example, the process may determine that the one speaker is speaking in German, so that it can configure a speech recognizer to recognize German language utterances.
At block 15.8201, the process performs concurrently processing the received data with multiple speech recognizers that are each configured to recognize speech in a different corresponding language. For example, the process may utilize speech recognizers for German, French, English, Chinese, Spanish, and the like, to attempt to recognize the speaker's utterance.
At block 15.8202, the process performs selecting as the first language the language corresponding to a speech recognizer of the multiple speech recognizers that produces a result that has a higher confidence level than other of the multiple speech recognizers. Typically, a speech recognizer may provide a confidence level corresponding with each recognition result. The process can exploit this confidence level to determine the most likely language being spoken by the one speaker, such as by taking the result with the highest confidence level, if one exists.
At block 15.8301, the process performs identifying signal characteristics in the received data that are correlated with the first language. In some embodiments, the process may exploit signal properties or characteristics that are highly correlated with particular languages. For example, spoken German may include phonemes that are unique to or at least more common in German than in other languages.
At block 15.8401, the process performs receiving an indication of a current location of the user. The current location may be based on a GPS coordinate provided by the conferencing device or some other device. The current location may be determined based on other context information, such as a network identifier, travel documents, or the like.
At block 15.8402, the process performs determining one or more languages that are commonly spoken at the current location. The process may reference a knowledge base or other information that associates locations with common languages.
At block 15.8403, the process performs selecting one of the one or more languages as the first language.
At block 15.8501, the process performs presenting indications of multiple languages to the user. In some embodiments, the process may ask the user to choose the language of the one speaker. For example, the process may not be able to determine the language itself, or the process may have determined multiple equally likely candidate languages. In such circumstances, the process may prompt or otherwise request that the user indicate the language of the one speaker.
At block 15.8502, the process performs receiving from the user an indication of one of the multiple languages. The user may identify the language in various ways, such as via a spoken command, a gesture, a user interface input, or the like.
At block 15.8601, the process performs selecting a speech recognizer configured to recognize speech in the first language. Once the process has determined the language of the one speaker, it may select or configure a speech recognizer or other component (e.g., machine translation engine) to process the first language.
At block 15.8701, the process performs performing speech recognition, based on the speaker-related information, on the data representing the speech signal to convert the utterance in the first language into text representing the utterance in the first language. The speech recognition process may be improved, augmented, or otherwise adapted based on the speaker-related information. In one example, information about vocabulary frequently used by the one speaker may be used to improve the performance of a speech recognizer.
At block 15.8702, the process performs translating, based on the speaker-related information, the text representing the utterance in the first language into text representing the message in the second language. Translating from a first to a second language may also be improved, augmented, or otherwise adapted based on the speaker-related information. For example, when such a translation includes natural language processing to determine syntactic or semantic information about an utterance, such natural language processing may be improved with information about the one speaker, such as idioms, expressions, or other language constructs frequently employed or otherwise correlated with the one speaker.
At block 15.8801, the process performs performing speech synthesis to convert the text representing the utterance in the second language into audio data representing the message in the second language.
At block 15.8802, the process performs causing the audio data representing the message in the second language to be played to the user. The message may be played, for example, via an audio speaker of the conferencing device.
At block 15.8901, the process performs performing speech recognition based on cepstral coefficients that represent the speech signal. In other embodiments, other types of features or information may be also or instead used to perform speech recognition, including language models, dialect models, or the like.
At block 15.9001, the process performs performing hidden Markov model-based speech recognition. Other approaches or techniques for speech recognition may include neural networks, stochastic modeling, or the like.
At block 15.9101, the process performs translating the utterance based on speaker-related information including an identity of the one speaker. The identity of the one speaker may be used in various ways, such as to determine a speaker-specific vocabulary to use during speech recognition, natural language processing, machine translation, or the like.
At block 15.9201, the process performs translating the utterance based on speaker-related information including a language model that is specific to the one speaker. A speaker-specific language model may include or otherwise identify frequent words or patterns of words (e.g., n-grams) based on prior communications or other information about the one speaker. Such a language model may be based on communications or other information generated by or about the one speaker. Such a language model may be employed in the course of speech recognition, natural language processing, machine translation, or the like. Note that the language model need not be unique to the one speaker, but may instead be specific to a class, type, or group of speakers that includes the one speaker. For example, the language model may be tailored for speakers in a particular industry, from a particular region, or the like.
At block 15.9301, the process performs translating the utterance based on a language model that is tailored to a group of people of which the one speaker is a member. As noted, the language model need not be unique to the one speaker. In some embodiments, the language model may be tuned to particular social classes, ethnic groups, countries, languages, or the like with which the one speaker may be associated.
At block 15.9401, the process performs generating the language model based on information items generated by the one speaker, the information items including at least one of emails transmitted by the one speaker, documents authored by the one speaker, and/or social network messages transmitted by the one speaker. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, social network messages, and the like to generate a language model that is specific or otherwise tailored to the one speaker.
At block 15.9501, the process performs translating the utterance based on speaker-related information including a language model tailored to the voice conference. A language model tailored to the voice conference may include or otherwise identify frequent words or patterns of words (e.g., n-grams) based on prior communications or other information about any one or more of the speakers in the voice conference. Such a language model may be based on communications or other information generated by or about the speakers in the voice conference. Such a language model may be employed in the course of speech recognition, natural language processing, machine translation, or the like.
At block 15.9601, the process performs generating the language model based on information items by or about any of the multiple speakers, the information items including at least one of emails, documents, and/or social network messages. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, social network messages, and the like to generate a language model that is tailored to the voice conference.
At block 15.9701, the process performs translating the utterance based on speaker-related information including a speech model that is tailored to the one speaker. A speech model tailored to the one speaker (e.g., representing properties of the speech signal of the user) may be used to adapt or improve the performance of a speech recognizer. Note that the speech model need not be unique to the one speaker, but may instead be specific to a class, type, or group of speakers that includes the one speaker. For example, the speech model may be tailored for male speakers, female speakers, speakers from a particular country or region (e.g., to account for accents), or the like.
At block 15.9801, the process performs translating the utterance based on a speech model that is tailored to a group of people of which the one speaker is a member. As noted, the speech model need not be unique to the one speaker. In some embodiments, the speech model may be tuned to particular genders, social classes, ethnic groups, countries, languages, or the like with which the one speaker may be associated.
At block 15.9901, the process performs translating the utterance based on speaker-related information including an information item that references the one speaker. The information item may include a document, a message, a calendar event, a social networking relation, or the like. Various forms of information items are contemplated, including textual (e.g., emails, text messages, chats), audio (e.g., voice messages), video, or the like. In some embodiments, an information item may include content in multiple forms, such as text and audio, such as when an email includes a voice attachment.
At block 15.10001, the process performs translating the utterance based on speaker-related information including at least one of a document that references the one speaker, a message that references the one speaker, a calendar event that references the one speaker, an indication of gender of the one speaker, and/or an organization to which the one speaker belongs. A document may be, for example, a report authored by the one speaker. A message may be an email, text message, social network status update or other communication that is sent by the one speaker, sent to the one speaker, or references the one speaker in some other way. A calendar event may represent a past or future event to which the one speaker was invited. An event may be any occurrence that involves or involved the user and/or the one speaker, such as a meeting (e.g., social or professional meeting or gathering) attended by the user and the one speaker, an upcoming deadline (e.g., for a project), or the like. Information about the gender of the one speaker may be used to customize or otherwise adapt a speech or language model that may be used during machine translation. The process may exploit an understanding of an organization to which the one speaker belongs when performing natural language processing on the utterance. For example, the identity of a company that employs the one speaker can be used to determine the meaning of industry-specific vocabulary in the utterance of the one speaker. The organization may include a business, company (e.g., profit or non-profit), group, school, club, team, company, or other formal or informal organization with which the one speaker is affiliated.
At block 15.10101, the process performs recording history information about the voice conference. In some embodiments, the process may record the voice conference and related information, so that such information can be played back at a later time, such as for reference purposes, for a participant who joins the conference late, or the like.
At block 15.10102, the process performs presenting the history information about the voice conference. Presenting the history information may include playing back audio, displaying a transcript, presenting indications topics of conversation, or the like.
At block 15.10201, the process performs presenting the history information to a new participant in the voice conference, the new participant having joined the voice conference while the voice conference was already in progress. In some embodiments, the process may play back history information to a late arrival to the voice conference, so that the new participant may catch up with the conversation without needing to interrupt the proceedings.
At block 15.10301, the process performs presenting the history information to a participant in the voice conference, the participant having rejoined the voice conference after having left the voice conference for a period of time. In some embodiments, the process may play back history information to a participant who leaves and then rejoins the conference, for example when a participant temporarily leaves to visit the restroom, obtain some food, or attend to some other matter.
At block 15.10401, the process performs presenting at least one of a transcription of utterances made by speakers during the voice conference, indications of topics discussed during the voice conference, and/or indications of information items related to subject matter of the voice conference. The process may present various types of information about the voice conference, including a transcription (e.g., text of what was said and by whom), topics discussed (e.g., based on terms frequently used by speakers during the conference), relevant information items (e.g., emails, documents, plans, agreements mentioned by one or more speakers), or the like.
At block 15.10501, the process performs recording the data representing speech signals from the voice conference. The process may record speech, and then use such recordings for later playback, as a source for transcription, or for other purposes.
At block 15.10601, the process performs recording a transcription of utterances made by speakers during the voice conference. If the process performs speech recognition as discussed herein, it may record the results of such speech recognition as a transcription of the voice conference.
At block 15.10701, the process performs recording indications of topics discussed during the voice conference. Topics of conversation may be identified in various ways. For example, the process may track entities or terms that are commonly mentioned during the course of the voice conference. As another example, the process may attempt to identify agenda items which are typically discussed early in the voice conference. The process may also or instead refer to messages or other information items that are related to the voice conference, such as by analyzing email headers (e.g., subject lines) of email messages sent between participants in the voice conference.
At block 15.10801, the process performs recording indications of information items related to subject matter of the voice conference. The process may track information items that are mentioned during the voice conference or otherwise related to participants in the voice conference, such as emails sent between participants in the voice conference.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 13.100. In addition, the computing system 16.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 13.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 16.400 comprises a computer memory (“memory”) 16.401, a display 16.402, one or more Central Processing Units (“CPU”) 16.403, Input/Output devices 16.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 16.405, and network connections 16.406. The AEFS 13.100 is shown residing in memory 16.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 13.100 may be stored on and/or transmitted over the other computer-readable media 16.405. The components of the AEFS 13.100 preferably execute on one or more CPUs 16.403 and facilitate ability enhancement, as described herein. Other code or programs 16.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 16.420, also reside in the memory 16.401, and preferably execute on one or more CPUs 16.403. Of note, one or more of the components in
The AEFS 13.100 interacts via the network 16.450 with conferencing devices 13.120, speaker-related information sources 13.130, and third-party systems/applications 16.455. The network 16.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 16.455 may include any systems that provide data to, or utilize data from, the AEFS 13.100, including Web browsers, e-commerce sites, calendar applications, email systems, social networking services, and the like.
The AEFS 13.100 is shown executing in the memory 16.401 of the computing system 16.400. Also included in the memory are a user interface manager 16.415 and an application program interface (“API”) 16.416. The user interface manager 16.415 and the API 16.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 13.100.
The UI manager 16.415 provides a view and a controller that facilitate user interaction with the AEFS 13.100 and its various components. For example, the UI manager 16.415 may provide interactive access to the AEFS 13.100, such that users can configure the operation of the AEFS 13.100, such as by providing the AEFS 13.100 credentials to access various sources of speaker-related information, including social networking services, email systems, document stores, or the like. In some embodiments, access to the functionality of the UI manager 16.415 may be provided via a Web server, possibly executing as one of the other programs 16.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 16.455 can interact with the AEFS 13.100 via the UI manager 16.415.
The API 16.416 provides programmatic access to one or more functions of the AEFS 13.100. For example, the API 16.416 may provide a programmatic interface to one or more functions of the AEFS 13.100 that may be invoked by one of the other programs 16.430 or some other module. In this manner, the API 16.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 13.100 into Web applications), and the like.
In addition, the API 16.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the conferencing devices 13.120, information sources 13.130, and/or one of the third-party systems/applications 16.455, to access various functions of the AEFS 13.100. For example, an information source 13.130 may push speaker-related information (e.g., emails, documents, calendar events) to the AEFS 13.100 via the API 16.416. The API 16.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 16.455 and that are configured to interact with the AEFS 13.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 13.100 are implemented using standard programming techniques. For example, the AEFS 13.100 may be implemented as a “native” executable running on the CPU 16.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 13.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 16.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 13.100, such as in the data store 16.420 (or 14.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 16.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 13.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for ability enhancement and, more particularly, for enhancing a user's ability to operate or function in a transportation-related context (e.g., as a pedestrian or vehicle operator) by performing vehicular threat detection based at least in part on analyzing audio signals emitted by other vehicles present in a roadway or other context. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”). Embodiments of the AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities (e.g., driving, riding a bike, walking/running) of a user.
In some embodiments, the AEFS is configured to identify threats posed by vehicles to a user of a roadway, and to provide information about such threats to the user so that he may take evasive action. Identifying threats may include analyzing audio data, such as sounds emitted by a vehicle in order to determine whether the user and the vehicle may be on a collision course. Other types and sources of data may also or instead be utilized, including video data, range information, conditions information (e.g., weather, temperature, time of day), or the like. The user may be a pedestrian (e.g., a walker, a jogger), an operator of a motorized (e.g., car, motorcycle, moped, scooter) or non-motorized vehicle (e.g., bicycle, pedicab, rickshaw), a vehicle passenger, or the like. In some embodiments, the user wears a wearable device (e.g., a helmet, goggles, eyeglasses, hat) that is configured to at least present determined vehicular threat information to the user.
In some embodiments, the AEFS is configured to receive data representing an audio signal emitted by a first vehicle. The audio signal is typically obtained in proximity to a user, who may be a pedestrian or traveling in a vehicle as an operator or a passenger. In some embodiments, the audio signal is obtained by one or more microphones coupled to the user's vehicle and/or a wearable device of the user, such as a helmet, goggles, a hat, a media player, or the like.
Then, the AEFS determines vehicular threat information based at least in part on the data representing the audio signal. In some embodiments, the AEFS may analyze the received data in order to determine whether the first vehicle represents a threat to the user, such as because the first vehicle and the user may be on a collision course. The audio data may be analyzed in various ways, including by performing audio analysis, frequency analysis (e.g., Doppler analysis), acoustic localization, or the like. Other sources of information may also or instead be used, including information received from the first vehicle, a vehicle of the user, other vehicles, in-situ sensors and devices (e.g., traffic cameras, range sensors, induction coils), traffic information systems, weather information systems, and the like.
Next, the AEFS informs the user of the determined vehicular threat information via a wearable device of the user. Typically, the user's wearable device (e.g., a helmet) will include one or more output devices, such as audio speakers, visual display devices (e.g., warning lights, screens, heads-up displays), haptic devices, and the like. The AEFS may present the vehicular threat information via one or more of these output devices. For example, the AEFS may visually display or speak the words “Car on left.” As another example, the AEFS may visually display a leftward pointing arrow on a heads-up screen displayed on a face screen of the user's helmet. Presenting the vehicular threat information may also or instead include presenting a recommended course of action (e.g., to slow down, to speed up, to turn) to mitigate the determined vehicular threat.
In this example, the moped 17.110a is driving towards the motorcycle 17.110b from a side street, at approximately a right angle with respect to the path of travel of the motorcycle 17.110b. The traffic signal 17.106 has just turned from red to green for the motorcycle 17.110b, and the user 17.104 is beginning to drive the motorcycle 17.110 into the intersection controlled by the traffic signal 17.106. The user 17.104 is assuming that the moped 17.110a will stop, because cross traffic will have a red light. However, in this example, the moped 17.110a may not stop in a timely manner, for one or more reasons, such as because the operator of the moped 17.110a has not seen the red light, because the moped 17.110a is moving at an excessive rate, because the operator of the moped 17.110a is impaired, because the surface conditions of the roadway are icy or slick, or the like. As will be discussed further below, the AEFS 17.100 will determine that the moped 17.110a and the motorcycle 17.110b are likely on a collision course, and inform the user 17.104 of this threat via the helmet 17.120a, so that the user may take evasive action to avoid a possible collision with the moped 17.110a.
The moped 17.110 emits an audio signal 17.101 (e.g., a sound wave emitted from its engine) which travels in advance of the moped 17.110a. The audio signal 17.101 is received by a microphone (not shown) on the helmet 17.120a and/or the motorcycle 17.110b. In some embodiments, a computing and communication device within the helmet 17.120a samples the audio signal 17.101 and transmits the samples to the AEFS 17.100. In other embodiments, other forms of data may be used to represent the audio signal 17.101, including frequency coefficients, compressed audio, or the like.
The AEFS 17.100 determines vehicular threat information by analyzing the received data that represents the audio signal 17.101. The AEFS 17.100 may use one or more audio analysis techniques to determine the vehicular threat information. In one embodiment, the AEFS 17.100 performs a Doppler analysis (e.g., by determining whether the frequency of the audio signal is increasing or decreasing) to determine that the object that is emitting the audio signal is approaching (and possibly at what rate) the user 17.104. In some embodiments, the AEFS 17.100 may determine the type of vehicle (e.g., a heavy truck, a passenger vehicle, a motorcycle, a moped) by analyzing the received data to identify an audio signature that is correlated with a particular engine type or size. For example, a lower frequency engine sound may be correlated with a larger vehicle size, and a higher frequency engine sound may be correlated with a smaller vehicle size.
In one embodiment, the AEFS 17.100 performs acoustic source localization to determine information about the trajectory of the moped 17.110a, including one or more of position, direction of travel, speed, acceleration, or the like. Acoustic source localization may include receiving data representing the audio signal 17.101 as measured by two or more microphones. For example, the helmet 17.120a may include four microphones (e.g., front, right, rear, and left) that each receive the audio signal 17.101. These microphones may be directional, such that they can be used to provide directional information (e.g., an angle between the helmet and the audio source). Such directional information may then be used by the AEFS 17.100 to triangulate the position of the moped 17.110a. As another example, the AEFS 17.100 may measure differences between the arrival time of the audio signal 17.101 at multiple distinct microphones on the helmet 17.120a or other location. The difference in arrival time, together with information about the distance between the microphones, can be used by the AEFS 17.100 to determine distances between each of the microphones and the audio source, such as the moped 17.110a. Distances between the microphones and the audio source can then be used to determine one or more locations at which the audio source may be located.
Determining vehicular threat information may also include obtaining information such as the position, trajectory, and speed of the user 17.104, such as by receiving data representing such information from sensors, devices, and/or systems on board the motorcycle 17.110b and/or the helmet 17.120a. Such sources of information may include a speedometer, a geo-location system (e.g., GPS system), an accelerometer, or the like. Once the AEFS 17.100 has determined and/or obtained information such as the position, trajectory, and speed of the moped 17.110a and the user 17.104, the AEFS 17.100 may determine whether the moped 17.110a and the user 17.104 are likely to collide with one another. For example, the AEFS 17.100 may model the expected trajectories of the moped 17.110a and user 17.104 to determine whether they intersect at or about the same point in time.
The AEFS 17.100 may then present the determined vehicular threat information (e.g., that the moped 17.110a represents a hazard) to the user 17.104 via the helmet 17.120a. Presenting the vehicular threat information may include transmitting the information to the helmet 17.120a, where it is received and presented to the user. In one embodiment, the helmet 17.120a includes audio speakers that may be used to output an audio signal (e.g., an alarm or voice message) warning the user 17.104. In other embodiments, the helmet 17.120a includes a visual display, such as a heads-up display presented upon a face screen of the helmet 17.120a, which can be used to present a text message (e.g., “Look left”) or an icon (e.g., a red arrow pointing left).
The AEFS 17.100 may also use information received from in-situ sensors and/or devices. For example, the AEFS 17.100 may use information received from a camera 17.108 that is mounted on the traffic signal 17.106 that controls the illustrated intersection. The AEFS 17.100 may receive image data that represents the moped 17.110a and/or the motorcycle 17.110b. The AEFS 17.100 may perform image recognition to determine the type and/or position of a vehicle that is approaching the intersection. The AEFS 17.100 may also or instead analyze multiple images (e.g., from a video signal) to determine the velocity of a vehicle. Other types of sensors or devices installed in or about a roadway may also or instead by used, including range sensors, speed sensors (e.g., radar guns), induction coils (e.g., mounted in the roadbed), temperature sensors, weather gauges, or the like.
As noted above, the AEFS 17.100 may utilize data that represents an audio signal as detected by multiple different microphones. In the example of
The AEFS 17.100 may model vehicles and other objects, such as by representing their positions, speeds, acceleration, and other information. Such a model may then be used to determine whether objects are likely to collide. Note that the model may be probabilistic. For example the AEFS 17.100 may represent an object's position in space as a region that includes multiple positions that each have a corresponding likelihood that that the object is at that position. As another example, the AEFS 17.100 may represent the velocity of an object as a range of likely values, a probability distribution, or the like.
The AEFS 17.100 may interact with various types of wearable devices 17.120, including a motorcycle helmet 17.120a (
In some embodiments, a wearable device may perform some or all of the functions of the AEFS 17.100, even though the AEFS 17.100 is depicted as separate in these examples. Some devices may have minimal processing power and thus perform only some of the functions. For example, the eyeglasses 17.120b may receive vehicular threat information from a remote AEFS 17.100, and display it on a heads-up display displayed on the inside of the lenses of the eyeglasses 17.120b. Other wearable devices may have sufficient processing power to perform more of the functions of the AEFS 17.100. For example, the personal media device 17.120e may have considerable processing power and as such be configured to perform acoustic source localization, collision detection analysis, or other more computational expensive functions.
Note that the wearable devices 17.120 may act in concert with one another or with other entities to perform functions of the AEFS 17.100. For example, the eyeglasses 17.120b may include a display mechanism that receives and displays vehicular threat information determined by the personal media device 17.120e. As another example, the goggles 17.120c may include a display mechanism that receives and displays vehicular threat information determined by a computing device in the helmet 17.120a or 17.120d. In a further example, one of the wearable devices 17.120 may receive and process audio data received by microphones mounted on the vehicle 17.110c.
The AEFS 17.100 may also or instead interact with vehicles 17.110 and/or computing devices installed thereon. As noted, a vehicle 17.110 may have one or more sensors or devices that may operate as (direct or indirect) sources of information for the AEFS 17.100. The vehicle 17.110c, for example, may include a speedometer, an accelerometer, one or more microphones, one or more range sensors, or the like. Data obtained by, at, or from such devices of vehicle 17.110c may be forwarded to the AEFS 17.100, possibly by a wearable device 17.120 of an operator of the vehicle 17.110c.
In some embodiments, the vehicle 17.110c may itself have or use an AEFS, and be configured to transmit warnings or other vehicular threat information to others. For example, an AEFS of the vehicle 17.110c may have determined that the moped 17.110a was driving with excessive speed just prior to the scenario depicted in
The AEFS 17.100 may also or instead interact with sensors and other devices that are installed on, in, or about roads or in other transportation related contexts, such as parking garages, racetracks, or the like. In this example, the AEFS 17.100 interacts with the camera 17.108 to obtain images of vehicles, pedestrians, or other objects present in a roadway. Other types of sensors or devices may include range sensors, infrared sensors, induction coils, radar guns, temperature gauges, precipitation gauges, or the like.
The AEFS 17.100 may further interact with information systems that are not shown in
Note that in some embodiments, at least some of the described techniques may be performed without the utilization of any wearable devices 17.120. For example, a vehicle 17.110 may itself include the necessary computation, input, and output devices to perform functions of the AEFS 17.100. For example, the AEFS 17.100 may present vehicular threat information on output devices of a vehicle 17.110, such as a radio speaker, dashboard warning light, heads-up display, or the like. As another example, a computing device on a vehicle 17.110 may itself determine the vehicular threat information.
The threat analysis engine 18.210 includes an audio processor 18.212, an image processor 18.214, other sensor data processors 18.216, and an object tracker 18.218. In the illustrated example, the audio processor 18.212 processes audio data received from the wearable device 17.120. As noted, such data may be received from other sources as well or instead, including directly from a vehicle-mounted microphone, or the like. The audio processor 18.212 may perform various types of signal processing, including audio level analysis, frequency analysis, acoustic source localization, or the like. Based on such signal processing, the audio processor 18.212 may determine strength, direction of audio signals, audio source distance, audio source type, or the like. Outputs of the audio processor 18.212 (e.g., that an object is approaching from a particular angle) may be provided to the object tracker 18.218 and/or stored in the data store 18.240.
The image processor 18.214 receives and processes image data that may be received from sources such as the wearable device 17.120 and/or information sources 17.130. For example, the image processor 18.214 may receive image data from a camera of the wearable device 17.120, and perform object recognition to determine the type and/or position of a vehicle that is approaching the user 17.104. As another example, the image processor 18.214 may receive a video signal (e.g., a sequence of images) and process them to determine the type, position, and/or velocity of a vehicle that is approaching the user 17.104. Outputs of the image processor 18.214 (e.g., position and velocity information, vehicle type information) may be provided to the object tracker 18.218 and/or stored in the data store 18.240.
The other sensor data processor 18.216 receives and processes data received from other sensors or sources. For example, the other sensor data processor 18.216 may receive and/or determine information about the position and/or movements of the user and/or one or more vehicles, such as based on GPS systems, speedometers, accelerometers, or other devices. As another example, the other sensor data processor 18.216 may receive and process conditions information (e.g., temperature, precipitation) from the information sources 17.130 and determine that road conditions are currently icy. Outputs of the other sensor data processor 18.216 (e.g., that the user is moving at 5 miles per hour) may be provided to the object tracker 18.218 and/or stored in the data store 18.240.
The object tracker 18.218 manages a geospatial object model that includes information about objects known to the AEFS 17.100. The object tracker 18.218 receives and merges information about object types, positions, velocity, acceleration, direction of travel, and the like, from one or more of the processors 18.212, 18.214, 18.216, and/or other sources. Based on such information, the object tracker 18.218 may identify the presence of objects as well as their likely positions, paths, and the like. The object tracker 18.218 may continually update this model as new information becomes available and/or as time passes (e.g., by plotting a likely current position of an object based on its last measured position and trajectory). The object tracker 18.218 may also maintain confidence levels corresponding to elements of the geo-spatial model, such as a likelihood that a vehicle is at a particular position or moving at a particular velocity, that a particular object is a vehicle and not a pedestrian, or the like.
The agent logic 18.220 implements the core intelligence of the AEFS 17.100. The agent logic 18.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to determine vehicular threat information. For example, the agent logic 18.220 may combine information from the object tracker 18.218, such as that there is a determined likelihood of a collision at an intersection, with information from one of the information sources 17.130, such as that the intersection is the scene of common red-light violations, and decide that the likelihood of a collision is high enough to transmit a warning to the user 17.104. As another example, the agent logic 18.220 may, in the face of multiple distinct threats to the user, determine which threat is the most significant and cause the user to avoid the more significant threat, such as by not directing the user 17.104 to slam on the brakes when a bicycle is approaching from the side but a truck is approaching from the rear, because being rear-ended by the truck would have more serious consequences than being hit from the side by the bicycle.
The presentation engine 18.230 includes a visible output processor 18.232 and an audible output processor 18.234. The visible output processor 18.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the wearable device 17.120 or some other display (e.g., a heads-up display of a vehicle 17.110 being driven by the user 17.104). The agent logic 18.220 may use or invoke the visible output processor 18.232 to prepare and display information, such as by formatting or otherwise modifying vehicular threat information to fit on a particular type or size of display. The audible output processor 18.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 18.220 may use or invoke the audible output processor 18.234 in order to convert a textual message (e.g., a warning message, a threat identification) into audio output suitable for presentation via the wearable device 17.120, for example by employing a text-to-speech processor.
Note that one or more of the illustrated components/modules may not be present in some embodiments. For example, in embodiments that do not perform image or video processing, the AEFS 17.100 may not include an image processor 18.214. As another example, in embodiments that do not perform audio output, the AEFS 17.100 may not include an audible output processor 18.234.
Note also that the AEFS 17.100 may act in service of multiple users 17.104. In some embodiments, the AEFS 17.100 may determine vehicular threat information concurrently for multiple distinct users. Such embodiments may further facilitate the sharing of vehicular threat information. For example, vehicular threat information determined as between two vehicles may be relevant and thus shared with a third vehicle that is in proximity to the other two vehicles.
FIGS. 19.1-19.70 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 19.101, the process performs receiving data representing an audio signal obtained in proximity to a user, the audio signal emitted by a first vehicle. The data representing the audio signal may be raw audio samples, compressed audio data, frequency coefficients, or the like. The data representing the audio signal may represent the sound made by the first vehicle, such as from its engine, a horn, tires, or any other source of sound. The data representing the audio signal may include sounds from other sources, including other vehicles, pedestrians, or the like. The audio signal may be obtained at or about a user who is a pedestrian or who is in a vehicle that is not the first vehicle, either as the operator or a passenger.
At block 19.102, the process performs determining vehicular threat information based at least in part on the data representing the audio signal. Vehicular threat information may be determined in various ways, including by analyzing the data representing the audio signal to determine whether it indicates that the first vehicle is approaching the user. Analyzing the data may be based on various techniques, including analyzing audio levels, frequency shifts (e.g., the Doppler Effect), acoustic source localization, or the like.
At block 19.103, the process performs presenting the vehicular threat information via a wearable device of the user. The determined threat information may be presented in various ways, such as by presenting an audible or visible warning or other indication that the first vehicle is approaching the user. Different types of wearable devices are contemplated, including helmets, eyeglasses, goggles, hats, and the like. In other embodiments, the vehicular threat information may also or instead be presented in other ways, such as via an output device on a vehicle of the user, in-situ output devices (e.g., traffic signs, road-side speakers), or the like.
At block 19.201, the process performs receiving data obtained at a microphone array that includes multiple microphones. In some embodiments, a microphone array having two or more microphones is employed to receive audio signals. Differences between the received audio signals may be utilized to perform acoustic source localization or other functions, as discussed further herein.
At block 19.301, the process performs receiving data obtained at a microphone array, the microphone array coupled to a vehicle of the user. In some embodiments, such as when the user is operating or otherwise traveling in a vehicle of his own (that is not the same as the first vehicle), the microphone array may be coupled or attached to the user's vehicle, such as by having a microphone located at each of the four corners of the user's vehicle.
At block 19.401, the process performs receiving data obtained at a microphone array, the microphone array coupled to the wearable device. For example, if the wearable device is a helmet, then a first microphone may be located on the left side of the helmet while a second microphone may be located on the right side of the helmet.
At block 19.501, the process performs determining a position of the first vehicle. The position of the first vehicle may be expressed absolutely, such as via a GPS coordinate or similar representation, or relatively, such as with respect to the position of the user (e.g., 20 meters away from the first user). In addition, the position of the first vehicle may be represented as a point or collection of points (e.g., a region, arc, or line).
At block 19.601, the process performs determining a velocity of the first vehicle. The process may determine the velocity of the first vehicle in absolute or relative terms (e.g., with respect to the velocity of the user). The velocity may be expressed or represented as a magnitude (e.g., 10 meters per second), a vector (e.g., having a magnitude and a direction), or the like.
At block 19.701, the process performs determining a direction of travel of the first vehicle. The process may determine a direction in which the first vehicle is traveling, such as with respect to the user and/or some absolute coordinate system.
At block 19.801, the process performs determining whether the first vehicle is approaching the user. Determining whether the first vehicle is approaching the user may include determining information about the movements of the user and the first vehicle, including position, direction of travel, velocity, acceleration, and the like. Based on such information, the process may determine whether the courses of the user and the first vehicle will (or are likely to) intersect one another.
At block 19.901, the process performs performing acoustic source localization to determine a position of the first vehicle based on multiple audio signals received via multiple microphones. The process may determine a position of the first vehicle by analyzing audio signals received via multiple distinct microphones. For example, engine noise of the first vehicle may have different characteristics (e.g., in volume, in time of arrival, in frequency) as received by different microphones. Differences between the audio signal measured at different microphones may be exploited to determine one or more positions (e.g., points, arcs, lines, regions) at which the first vehicle may be located.
At block 19.1001, the process performs receiving an audio signal via a first one of the multiple microphones, the audio signal representing a sound created by the first vehicle. In one approach, at least two microphones are employed. By measuring differences in the arrival time of an audio signal at the two microphones, the position of the first vehicle may be determined. The determined position may be a point, a line, an area, or the like.
At block 19.1002, the process performs receiving the audio signal via a second one of the multiple microphones.
At block 19.1003, the process performs determining the position of the first vehicle by determining a difference between an arrival time of the audio signal at the first microphone and an arrival time of the audio signal at the second microphone. In some embodiments, given information about the distance between the two microphones and the speed of sound, the process may determine the respective distances between each of the two microphones and the first vehicle. Given these two distances (along with the distance between the microphones), the process can solve for the one or more positions at which the first vehicle may be located.
At block 19.1101, the process performs triangulating the position of the first vehicle based on a first and second angle, the first angle measured between a first one of the multiple microphones and the first vehicle, the second angle measured between a second one of the multiple microphones and the first vehicle. In some embodiments, the microphones may be directional, in that they may be used to determine the direction from which the sound is coming. Given such information, the process may use triangulation techniques to determine the position of the first vehicle.
At block 19.1201, the process performs performing a Doppler analysis of the data representing the audio signal to determine whether the first vehicle is approaching the user. The process may analyze whether the frequency of the audio signal is shifting in order to determine whether the first vehicle is approaching or departing the position of the user. For example, if the frequency is shifting higher, the first vehicle may be determined to be approaching the user. Note that the determination is typically made from the frame of reference of the user (who may be moving or not). Thus, the first vehicle may be determined to be approaching the user when, as viewed from a fixed frame of reference, the user is approaching the first vehicle (e.g., a moving user traveling towards a stationary vehicle) or the first vehicle is approaching the user (e.g., a moving vehicle approaching a stationary user). In other embodiments, other frames of reference may be employed, such as a fixed frame, a frame associated with the first vehicle, or the like.
At block 19.1301, the process performs determining whether frequency of the audio signal is increasing or decreasing.
At block 19.1401, the process performs performing a volume analysis of the data representing the audio signal to determine whether the first vehicle is approaching the user. The process may analyze whether the volume (e.g., amplitude) of the audio signal is shifting in order to determine whether the first vehicle is approaching or departing the position of the user. An increasing volume may indicate that the first vehicle is approaching the user. As noted, different embodiments may use different frames of reference when making this determination.
At block 19.1501, the process performs determining whether volume of the audio signal is increasing or decreasing.
At block 19.1601, the process performs determining the vehicular threat information based on gaze information associated with the user. In some embodiments, the process may consider the direction in which the user is looking when determining the vehicular threat information. For example, the vehicular threat information may depend on whether the user is or is not looking at the first vehicle, as discussed further below.
At block 19.1701, the process performs receiving an indication of a direction in which the user is looking. In some embodiments, an orientation sensor such as a gyroscope or accelerometer may be employed to determine the orientation of the user's head, face, or other body part. In some embodiments, a camera or other image sensing device may track the orientation of the user's eyes.
At block 19.1702, the process performs determining that the user is not looking towards the first vehicle. As noted, the process may track the position of the first vehicle. Given this information, coupled with information about the direction of the user's gaze, the process may determine whether or not the user is (or likely is) looking in the direction of the first vehicle.
At block 19.1703, the process performs in response to determining that the user is not looking towards the first vehicle, directing the user to look towards the first vehicle. When it is determined that the user is not looking at the first vehicle, the process may warn or otherwise direct the user to look in that direction, such as by saying or otherwise presenting “Look right!”, “Car on your left,” or similar message.
At block 19.1801, the process performs identifying multiple threats to the user. The process may in some cases identify multiple potential threats, such as one car approaching the user from behind and another car approaching the user from the left. In some cases, one or more of the multiple threats may themselves arise if or when the user takes evasive action to avoid some other threat. For example, the process may determine that a bus traveling behind the user will become a threat if the user responds to a bike approaching from his side by slamming on the brakes.
At block 19.1802, the process performs identifying a first one of the multiple threats that is more significant than at least one other of the multiple threats. The process may rank, order, or otherwise evaluate the relative significance or risk presented by each of the identified threats. For example, the process may determine that a truck approaching from the right is a bigger risk than a bicycle approaching from behind. On the other hand, if the truck is moving very slowly (thus leaving more time for the truck and/or the user to avoid it) compared to the bicycle, the process may instead determine that the bicycle is the bigger risk.
At block 19.1803, the process performs causing the user to avoid the first one of the multiple threats. The process may so cause the user to avoid the more significant threat by warning the user of the more significant threat. In some embodiments, the process may instead or in addition display a ranking of the multiple threats. In some embodiments, the process may so cause the user by not informing the user of the less significant threat.
At block 19.1901, the process performs determining vehicular threat information related to factors other than ones related to the first vehicle. The process may consider a variety of other factors or information in addition to those related to the first vehicle, such as road conditions, the presence or absence of other vehicles, or the like.
At block 19.2001, the process performs determining that poor driving conditions exist. Poor driving conditions may include or be based on weather information (e.g., snow, rain, ice, temperature), time information (e.g., night or day), lighting information (e.g., a light sensor indicating that the user is traveling towards the setting sun), or the like.
At block 19.2101, the process performs determining that a limited visibility condition exists. Limited visibility may be due to the time of day (e.g., at dusk, dawn, or night), weather (e.g., fog, rain), or the like.
At block 19.2201, the process performs determining that there is stalled or slow traffic in proximity to the user. The process may receive and integrate information from traffic information systems (e.g., that report accidents), other vehicles (e.g., that are reporting their speeds), or the like.
At block 19.2301, the process performs determining that poor surface conditions exist on a roadway traveled by the user. Poor surface conditions may be due to weather (e.g., ice, snow, rain), temperature, surface type (e.g., gravel road), foreign materials (e.g., oil), or the like.
At block 19.2401, the process performs determining that there is a pedestrian in proximity to the user. The presence of pedestrians may be determined in various ways. In some embodiments pedestrians may wear devices that transmit their location and/or presence. In other embodiments, pedestrians may be detected based on their heat signature, such as by an infrared sensor on the wearable device, user vehicle, or the like.
At block 19.2501, the process performs determining that there is an accident in proximity to the user. Accidents may be identified based on traffic information systems that report accidents, vehicle-based systems that transmit when collisions have occurred, or the like.
At block 19.2601, the process performs determining that there is an animal in proximity to the user. The presence of an animal may be determined as discussed with respect to pedestrians, above.
At block 19.2701, the process performs determining the vehicular threat information based on kinematic information. The process may consider a variety of kinematic information received from various sources, such as the wearable device, a vehicle of the user, the first vehicle, or the like. The kinematic information may include information about the position, velocity, acceleration, or the like of the user and/or the first vehicle.
At block 19.2801, the process performs determining the vehicular threat information based on information about position, velocity, and/or acceleration of the user obtained from sensors in the wearable device. The wearable device may include position sensors (e.g., GPS), accelerometers, or other devices configured to provide kinematic information about the user to the process.
At block 19.2901, the process performs determining the vehicular threat information based on information about position, velocity, and/or acceleration of the user obtained from devices in a vehicle of the user. A vehicle occupied or operated by the user may include position sensors (e.g., GPS), accelerometers, speedometers, or other devices configured to provide kinematic information about the user to the process.
At block 19.3001, the process performs determining the vehicular threat information based on information about position, velocity, and/or acceleration of the first vehicle. The first vehicle may include position sensors (e.g., GPS), accelerometers, speedometers, or other devices configured to provide kinematic information about the user to the process. In other embodiments, kinematic information may be obtained from other sources, such as a radar gun deployed at the side of a road, from other vehicles, or the like.
At block 19.3101, the process performs presenting the vehicular threat information via an audio output device of the wearable device. The process may play an alarm, bell, chime, voice message, or the like that warns or otherwise informs the user of the vehicular threat information. The wearable device may include audio speakers operable to output audio signals, including as part of a set of earphones, earbuds, a headset, a helmet, or the like.
At block 19.3201, the process performs presenting the vehicular threat information via a visual display device of the wearable device. In some embodiments, the wearable device includes a display screen or other mechanism for presenting visual information. For example, when the wearable device is a helmet, a face shield of the helmet may be used as a type of heads-up display for presenting the vehicular threat information.
At block 19.3301, the process performs displaying an indicator that instructs the user to look towards the first vehicle. The displayed indicator may be textual (e.g., “Look right!”), iconic (e.g., an arrow), or the like.
At block 19.3401, the process performs displaying an indicator that instructs the user to accelerate, decelerate, and/or turn. An example indicator may be or include the text “Speed up,” “slow down,” “turn left,” or similar language.
At block 19.3501, the process performs directing the user to accelerate.
At block 19.3601, the process performs directing the user to decelerate.
At block 19.3701, the process performs directing the user to turn.
At block 19.3801, the process performs transmitting to the first vehicle a warning based on the vehicular threat information. The process may send or otherwise transmit a warning or other message to the first vehicle that instructs the operator of the first vehicle to take evasive action. The instruction to the first vehicle may be complimentary to any instructions given to the user, such that if both instructions are followed, the risk of collision decreases. In this manner, the process may help avoid a situation in which the user and the operator of the first vehicle take actions that actually increase the risk of collision, such as may occur when the user and the first vehicle are approaching head but do not turn away from one another.
At block 19.3901, the process performs presenting the vehicular threat information via an output device of a vehicle of the user, the output device including a visual display and/or an audio speaker. In some embodiments, the process may use other devices to output the vehicular threat information, such as output devices of a vehicle of the user, including a car stereo, dashboard display, or the like.
At block 19.4301, the process performs presenting the vehicular threat information via goggles worn by the user. The goggles may include a small display, an audio speaker, or haptic output device, or the like.
At block 19.4401, the process performs presenting the vehicular threat information via a helmet worn by the user. The helmet may include an audio speaker or visual output device, such as a display that presents information on the inside of the face screen of the helmet. Other output devices, including haptic devices, are contemplated.
At block 19.4501, the process performs presenting the vehicular threat information via a hat worn by the user. The hat may include an audio speaker or similar output device.
At block 19.4601, the process performs presenting the vehicular threat information via eyeglasses worn by the user. The eyeglasses may include a small display, an audio speaker, or haptic output device, or the like.
At block 19.4701, the process performs presenting the vehicular threat information via audio speakers that are part of at least one of earphones, a headset, earbuds, and/or a hearing aid. The audio speakers may be integrated into the wearable device. In other embodiments, other audio speakers (e.g., of a car stereo) may be employed instead or in addition.
At block 19.4801, the process performs performing the receiving data representing an audio signal, the determining vehicular threat information, and/or the presenting the vehicular threat information on a computing device in the wearable device of the user. In some embodiments, a computing device of or in the wearable device may be responsible for performing one or more of the operations of the process. For example, a computing device situated within a helmet worn by the user may receive and analyze audio data to determine and present the vehicular threat information to the user.
At block 19.4901, the process performs performing the receiving data representing an audio signal, the determining vehicular threat information, and/or the presenting the vehicular threat information on a road-side computing system. In some embodiments, an in-situ computing system may be responsible for performing one or more of the operations of the process. For example, a computing system situated at or about a street intersection may receive and analyze audio signals of vehicles that are entering or nearing the intersection. Such an architecture may be beneficial when the wearable device is a “thin” device that does not have sufficient processing power to, for example, determine whether the first vehicle is approaching the user.
At block 19.4902, the process performs transmitting the vehicular threat information from the road-side computing system to the wearable device of the user. For example, when the road-side computing system determines that two vehicles may be on a collision course, the computing system can transmit vehicular threat information to the wearable device so that the user can take evasive action and avoid a possible accident.
At block 19.5001, the process performs performing the receiving data representing an audio signal, the determining vehicular threat information, and/or the presenting the vehicular threat information on a computing system in the first vehicle. In some embodiments, a computing system in the first vehicle performs one or more of the operations of the process. Such an architecture may be beneficial when the wearable device is a “thin” device that does not have sufficient processing power to, for example, determine whether the first vehicle is approaching the user.
At block 19.5002, the process performs transmitting the vehicular threat information from the computing system to the wearable device of the user.
At block 19.5101, the process performs performing the receiving data representing an audio signal, the determining vehicular threat information, and/or the presenting the vehicular threat information on a computing system in a second vehicle, wherein the user is not traveling in the second vehicle. In some embodiments, other vehicles that are not carrying the user and are not the same as the first user may perform one or more of the operations of the process. In general, computing systems/devices situated in or at multiple vehicles, wearable devices, or fixed stations in a roadway may each perform operations related to determining vehicular threat information, which may then be shared with other users and devices to improve traffic flow, avoid collisions, and generally enhance the abilities of users of the roadway.
At block 19.5102, the process performs transmitting the vehicular threat information from the computing system to the wearable device of the user.
At block 19.5201, the process performs receiving data representing a visual signal that represents the first vehicle. In some embodiments, the process may also consider video data, such as by performing image processing to identify vehicles or other hazards, to determine whether collisions may occur, and the like. The video data may be obtained from various sources, including the wearable device, a vehicle, a road-side camera, or the like.
At block 19.5202, the process performs determining the vehicular threat information based further on the data representing the visual signal. For example, the process may determine that a car is approaching by analyzing an image taken from a camera that is part of the wearable device.
At block 19.5301, the process performs receiving an image of the first vehicle obtained by a camera of a vehicle operated by the user. The user's vehicle may include one or more cameras that may capture views to the front, sides, and/or rear of the vehicle, and provide these images to the process for image processing or other analysis.
At block 19.5401, the process performs receiving an image of the first vehicle obtained by a camera of the wearable device. For example, where the wearable device is a helmet, the helmet may include one or more helmet cameras that may capture views to the front, sides, and/or rear of the helmet.
At block 19.5501, the process performs identifying the first vehicle in an image represented by the data representing a visual signal. Image processing techniques may be employed to identify the presence of a vehicle, its type (e.g., car or truck), its size, or other information.
At block 19.5601, the process performs determining whether the first vehicle is moving towards the user based on multiple images represented by the data representing the visual signal. In some embodiments, a video feed or other sequence of images may be analyzed to determine the relative motion of the first vehicle. For example, if the first vehicle appears to be becoming larger over a sequence of images, then it is likely that the first vehicle is moving towards the user.
At block 19.5701, the process performs receiving data representing the first vehicle obtained at a road-based device. In some embodiments, the process may also consider data received from devices that are located in or about the roadway traveled by the user. Such devices may include cameras, loop coils, motion sensors, and the like.
At block 19.5702, the process performs determining the vehicular threat information based further on the data representing the first vehicle. For example, the process may determine that a car is approaching the user by analyzing an image taken from a camera that is mounted on or near a traffic signal over an intersection.
At block 19.5801, the process performs receiving the data from a sensor deployed at an intersection. Various types of sensors are contemplated, including cameras, range sensors (e.g., sonar, LIDAR, IR-based), magnetic coils, audio sensors, or the like.
At block 19.5901, the process performs receiving an image of the first vehicle from a camera deployed at an intersection. For example, the process may receive images from a camera that is fixed to a traffic light or other signal at an intersection.
At block 19.6001, the process performs receiving ranging data from a range sensor deployed at an intersection, the ranging data representing a distance between the first vehicle and the intersection. For example, the process may receive a distance (e.g., 75 meters) measured between some known point in the intersection (e.g., the position of the range sensor) and an oncoming vehicle.
At block 19.6101, the process performs receiving data from an induction loop deployed in a road surface, the induction loop configured to detect the presence and/or velocity of the first vehicle. Induction loops may be embedded in the roadway and configured to detect the presence of vehicles passing over them. Some types of loops and/or processing may be employed to detect other information, including velocity, vehicle size, and the like.
At block 19.6201, the process performs identifying the first vehicle in an image obtained from the road-based sensor. Image processing techniques may be employed to identify the presence of a vehicle, its type (e.g., car or truck), its size, or other information.
At block 19.6301, the process performs determining a trajectory of the first vehicle based on multiple images obtained from the road-based device. In some embodiments, a video feed or other sequence of images may be analyzed to determine the position, speed, and/or direction of travel of the first vehicle.
At block 19.6401, the process performs receiving data representing vehicular threat information relevant to a second vehicle, the second vehicle not being used for travel by user. As noted, vehicular threat information may in some embodiments be shared amongst vehicles and entities present in a roadway. For example, a vehicle that is traveling just ahead of the user may determine that it is threatened by the first vehicle. This information may be shared with the user so that the user can also take evasive action, such as by slowing down or changing course.
At block 19.6402, the process performs determining the vehicular threat information based on the data representing vehicular threat information relevant to the second vehicle. Having received vehicular threat information from the second vehicle, the process may determine that it is also relevant to the user, and then accordingly present it to the user.
At block 19.6501, the process performs receiving from the second vehicle an indication of stalled or slow traffic encountered by the second vehicle. Various types of threat information relevant to the second vehicle may be provided to the process, such as that there is stalled or slow traffic ahead of the second vehicle.
At block 19.6601, the process performs receiving from the second vehicle an indication of poor driving conditions experienced by the second vehicle. The second vehicle may share the fact that it is experiencing poor driving conditions, such as an icy or wet roadway.
At block 19.6701, the process performs receiving from the second vehicle an indication that the first vehicle is driving erratically. The second vehicle may share a determination that the first vehicle is driving erratically, such as by swerving, driving with excessive speed, driving too slow, or the like.
At block 19.6801, the process performs receiving from the second vehicle an image of the first vehicle. The second vehicle may include one or more cameras, and may share images obtained via those cameras with other entities.
At block 19.6901, the process performs transmitting the vehicular threat information to a second vehicle. As noted, vehicular threat information may in some embodiments be shared amongst vehicles and entities present in a roadway. In this example, the vehicular threat information is transmitted to a second vehicle (e.g., one following behind the user), so that the second vehicle may benefit from the determined vehicular threat information as well.
At block 19.7001, the process performs transmitting the vehicular threat information to an intermediary server system for distribution to other vehicles in proximity to the user. In some embodiments, intermediary systems may operate as relays for sharing the vehicular threat information with other vehicles and users of a roadway.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 17.100. In addition, the computing system 20.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 17.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 20.400 comprises a computer memory (“memory”) 20.401, a display 20.402, one or more Central Processing Units (“CPU”) 20.403, Input/Output devices 20.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 20.405, and network connections 20.406. The AEFS 17.100 is shown residing in memory 20.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 17.100 may be stored on and/or transmitted over the other computer-readable media 20.405. The components of the AEFS 17.100 preferably execute on one or more CPUs 20.403 and implement techniques described herein. Other code or programs 20.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 20.420, also reside in the memory 20.401, and preferably execute on one or more CPUs 20.403. Of note, one or more of the components in
The AEFS 17.100 interacts via the network 20.450 with wearable devices 17.120, information sources 17.130, and third-party systems/applications 20.455. The network 20.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 20.455 may include any systems that provide data to, or utilize data from, the AEFS 17.100, including Web browsers, vehicle-based client systems, traffic tracking, monitoring, or prediction systems, and the like.
The AEFS 17.100 is shown executing in the memory 20.401 of the computing system 20.400. Also included in the memory are a user interface manager 20.415 and an application program interface (“API”) 20.416. The user interface manager 20.415 and the API 20.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 17.100.
The UI manager 20.415 provides a view and a controller that facilitate user interaction with the AEFS 17.100 and its various components. For example, the UI manager 20.415 may provide interactive access to the AEFS 17.100, such that users can configure the operation of the AEFS 17.100, such as by providing the AEFS 17.100 with information about common routes traveled, vehicle types used, driving patterns, or the like. The UI manager 20.415 may also manage and/or implement various output abstractions, such that the AEFS 17.100 can cause vehicular threat information to be displayed on different media, devices, or systems. In some embodiments, access to the functionality of the UI manager 20.415 may be provided via a Web server, possibly executing as one of the other programs 20.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 20.455 can interact with the AEFS 17.100 via the UI manager 20.415.
The API 20.416 provides programmatic access to one or more functions of the AEFS 17.100. For example, the API 20.416 may provide a programmatic interface to one or more functions of the AEFS 17.100 that may be invoked by one of the other programs 20.430 or some other module. In this manner, the API 20.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 17.100 into vehicle-based client systems or devices), and the like.
In addition, the API 20.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the wearable devices 17.120, information sources 17.130, and/or one of the third-party systems/applications 20.455, to access various functions of the AEFS 17.100. For example, an information source 17.130 such as a radar gun installed at an intersection may push kinematic information (e.g., velocity) about vehicles to the AEFS 17.100 via the API 20.416. As another example, a weather information system may push current conditions information (e.g., temperature, precipitation) to the AEFS 17.100 via the API 20.416. The API 20.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 20.455 and that are configured to interact with the AEFS 17.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 17.100 are implemented using standard programming techniques. For example, the AEFS 17.100 may be implemented as a “native” executable running on the CPU 20.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 17.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 20.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 17.100, such as in the data store 20.420 (or 18.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 20.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 17.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for enhanced voice conferencing and, more particularly, for recording and presenting voice conference history information based on speaker-related information determined from speaker utterances and/or other sources. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”). The AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities of a user, such as by recording and presenting voice conference history based on speaker-related information related to participants in a voice conference (e.g., conference call, face-to-face meeting). For example, when multiple speakers engage in a voice conference (e.g., a telephone conference), the AEFS may “listen” to the voice conference in order to determine speaker-related information, such as identifying information (e.g., name, title) about the current speaker (or some other speaker) and/or events/communications relating to the current speaker and/or to the subject matter of the conference call generally. Then, the AEFS may record voice conference history information based on the determined speaker-related information. The recorded conference history information may include transcriptions of utterances made by users, indications of topics discussed during the voice conference, information items (e.g., email messages, calendar events, documents) related to the voice conference, or the like. Next, the AEFS may inform a user (typically one of the participants in the voice conference) of the recorded conference history information, such as by presenting the information via a conferencing device (e.g., smart phone, laptop, desktop telephone) associated with the user. The user can then receive the information (e.g., by reading or hearing it via the conferencing device) provided by the AEFS and advantageously use that information to avoid embarrassment (e.g., due to having joined the voice conference late and thus having missed some of its contents), engage in a more productive conversation (e.g., by quickly accessing information about events, deadlines, or communications discussed during the voice conference), or the like.
In some embodiments, the AEFS is configured to receive data that represents speech signals from a voice conference amongst multiple speakers. The multiple speakers may be remotely located from one another, such as by being in different rooms within a building, by being in different buildings within a site or campus, by being in different cities, or the like. Typically, the multiple speakers are each using a conferencing device, such as a land-line telephone, cell phone, smart phone, computer, or the like, to communicate with one another. In some cases, such as when the multiple speakers are together in one room, the speakers may not be using a conferencing device to communicate with one another, but at least one of the speakers may have a conferencing device (e.g., a smart phone or personal media player/device that records conference history information as described.
The AEFS may obtain the data that represents the speech signals from one or more of the conferencing devices and/or from some intermediary point, such as a conference call facility, chat system, videoconferencing system, PBX, or the like. The AEFS may then determine voice conference-related information, including speaker-related information associated with the one or more of the speakers. Determining speaker-related information may include identifying the speaker based at least in part on the received data, such as by performing speaker recognition and/or speech recognition with the received data. Determining speaker-related information may also or instead include determining an identifier (e.g., name or title) of the speaker, content of the speaker's utterance, an information item (e.g., a document, event, communication) that references the speaker, or the like. Next, the AEFS records conference history information based on the determined speaker-related information. In some embodiments, recording conference history information may include generating a timeline, log, history, or other structure that associates speaker-related information with a timestamp or other time indicator. Then, the AEFS may inform a user of the conference history information by, for example, visually presenting the conference history information via a display screen of a conferencing device associated with the user. In other embodiments, some other display may be used, such as a screen on a laptop computer that is being used by the user while the user is engaged in the voice conference via a telephone. In some embodiments, the AEFS may inform the user in an audible manner, such as by “speaking” the conference-history information via an audio speaker of the conferencing device.
In some embodiments, the AEFS may perform other services, including translating utterances made by speakers in a voice conference, so that a multi-lingual voice conference may be facilitated even when some speakers do not understand the language used by other speakers. In such cases, the determined speaker-related information may be used to enhance or augment language translation and/or related processes, including speech recognition, natural language processing, and the like. In addition, the conference history information may be recorded in one or more languages, so that it can be presented in a native language of each of one or more users.
The AEFS 21.100 and the conferencing devices 21.120 are communicatively coupled to one another via the communication system 21.150. The AEFS 21.100 is also communicatively coupled to speaker-related information sources 21.130, including messages 21.130a, documents 21.130b, and audio data 21.130c. The AEFS 21.100 uses the information in the information sources 21.130, in conjunction with data received from the conferencing devices 21.120, to determine information related to the voice conference, including speaker-related information associated with the speakers 21.102.
In the scenario illustrated in
The AEFS 21.100 receives data representing a speech signal that represents the utterance 21.110, such as by receiving a digital representation of an audio signal transmitted by conferencing device 21.120b. The data representing the speech signal may include audio samples (e.g., raw audio data), compressed audio data, speech vectors (e.g., mel frequency cepstral coefficients), and/or any other data that may be used to represent an audio signal. The AEFS 21.100 may receive the data in various ways, including from one or more of the conferencing devices or from some intermediate system (e.g., a voice conferencing system that is facilitating the conference between the conferencing devices 21.120).
The AEFS 21.100 then determines speaker-related information associated with the speaker 21.102b. Determining speaker-related information may include identifying the speaker 21.102b based on the received data representing the speech signal. In some embodiments, identifying the speaker may include performing speaker recognition, such as by generating a “voice print” from the received data and comparing the generated voice print to previously obtained voice prints. For example, the generated voice print may be compared to multiple voice prints that are stored as audio data 21.130c and that each correspond to a speaker, in order to determine a speaker who has a voice that most closely matches the voice of the speaker 21.102b. The voice prints stored as audio data 21.130c may be generated based on various sources of data, including data corresponding to speakers previously identified by the AEFS 21.100, voice mail messages, speaker enrollment data, or the like.
In some embodiments, identifying the speaker 21.102b may include performing speech recognition, such as by automatically converting the received data representing the speech signal into text. The text of the speaker's utterance may then be used to identify the speaker 21.102b. In particular, the text may identify one or more entities such as information items (e.g., communications, documents), events (e.g., meetings, deadlines), persons, or the like, that may be used by the AEFS 21.100 to identify the speaker 21.102b. The information items may be accessed with reference to the messages 21.130a and/or documents 21.130b. As one example, the speaker's utterance 21.110 may identify an email message that was sent to the speaker 21.102b and possibly others (e.g., “That sure was a nasty email Bob sent”). As another example, the speaker's utterance 21.110 may identify a meeting or other event to which the speaker 21.102b and possibly others are invited.
Note that in some cases, the text of the speaker's utterance 21.110 may not definitively identify the speaker 21.102b, such as because the speaker 21.102b has not previously met or communicated with other participants in the voice conference or because a communication was sent to recipients in addition to the speaker 21.102b. In such cases, there may be some ambiguity as to the identity of the speaker 21.102b. However, in such cases, a preliminary identification of multiple candidate speakers may still be used by the AEFS 21.100 to narrow the set of potential speakers, and may be combined with (or used to improve) other techniques, including speaker recognition, speech recognition, language translation, or the like. In addition, even if the speaker 21.102 is unknown to the user 21.102a the AEFS 21.100 may still determine useful demographic or other speaker-related information that may be fruitfully employed for speech recognition or other purposes.
Note also that speaker-related information need not definitively identify the speaker. In particular, it may also or instead be or include other information about or related to the speaker, such as demographic information including the gender of the speaker 21.102, his country or region of origin, the language(s) spoken by the speaker 21.102, or the like. Speaker-related information may include an organization that includes the speaker (along with possibly other persons, such as a company or firm), an information item that references the speaker (and possibly other persons), an event involving the speaker, or the like. The speaker-related information may generally be determined with reference to the messages 21.130a, documents 21.130b, and/or audio data 21.130c. For example, having determined the identity of the speaker 21.102, the AEFS 21.100 may search for emails and/or documents that are stored as messages 21.130a and/or documents 21.103b and that reference (e.g., are sent to, are authored by, are named in) the speaker 21.102.
Other types of speaker-related information is contemplated, including social networking information, such as personal or professional relationship graphs represented by a social networking service, messages or status updates sent within a social network, or the like. Social networking information may also be derived from other sources, including email lists, contact lists, communication patterns (e.g., frequent recipients of emails), or the like.
The AEFS 21.100 then determines and/or records (e.g., stores, saves) conference history information based on the determined speaker-related information. For example, the AEFS 21.100 may associate a timestamp with speaker-related information, such a transcription of an utterance (e.g., generated by a speech recognition process), an indication of an information item referenced by a speaker (e.g., a message, a document, a calendar event), topics discussed during the voice conference, or the like. The conference history information may be recorded locally to the AEFS 21.100, on conferencing devices 21.120, or other locations, such as cloud-based storage systems.
The AEFS 21.100 then informs the user (speaker 21.102a) of at least some of the conference history information. Informing the user may include audibly presenting the information to the user via an audio speaker of the conferencing device 21.120a. In this example, the conferencing device 21.120a tells the user 21.102a, such as by playing audio via an earpiece or in another manner that cannot be detected by the other participants in the voice conference, to check the conference history presented by conferencing device 21.120a. In particular, the conferencing device 21.120a plays audio that includes the utterance 21.113 “Check history” to the user. The AEFS 21.100 may cause the conferencing device 21.120a to play such a notification because, for example, it has automatically searched the conference history and determined that the topic of the deadline has been previously discussed during the voice conference.
Informing the user of the conference history information may also or instead include visually presenting the information, such as via the display 21.121 of the conferencing device 21.120a. In the illustrated example, the AEFS 21.100 causes a message 21.112 that includes a portion of a transcript of the voice conference to be displayed on the display 21.121. In this example, the displayed transcript includes a statement from Bill (speaker 21.102b) that sets the project deadline to next week, not tomorrow. Upon reading the message 21.112 and thereby learning of the previously established project deadline, the speaker 21.102a responds to the original utterance 21.110 of speaker 21.102b (Bill) with a response utterance 21.114 that includes the words “But earlier Bill said next week,” referring to the earlier statement of speaker 21.102b that is counter to the deadline expressed by his current utterance 21.110. In the illustrated example, speaker 21.102c, upon hearing the utterance 21.114, responds with an utterance 21.115 that includes the words “I agree with Joe,” indicating his agreement with speaker 21.102a.
As the speakers 21.102a-102c continue to engage in the voice conference, the AEFS 21.100 may monitor the conversation and continue to record and present conference history information based on speaker-related information at least for the speaker 21.102a. Another example function that may be performed by the AEFS 21.100 includes concurrently presenting speaker-related information as it is determined, such as by presenting, as each of the multiple speakers takes a turn speaking during the voice conference, information about the identity of the current speaker. For example, in response to the onset of an utterance of a speaker, the AEFS 21.100 may display the name of the speaker on the display 21.121, so that the user is always informed as to who is speaking.
The AEFS 21.100 may perform other services, including translating utterances made by speakers in the voice conference, so that a multi-lingual voice conference may be conducted even between participants who do not understand all of the languages being spoken. Translating utterances may initially include determining speaker-related information by automatically determining the language that is being used by a current speaker. Determining the language may be based on signal processing techniques that identify signal characteristics unique to particular languages. Determining the language may also or instead be performed by simultaneous or concurrent application of multiple speech recognizers that are each configured to recognize speech in a corresponding language, and then choosing the language corresponding to the recognizer that produces the result having the highest confidence level. Determining the language may also or instead be based on contextual factors, such as GPS information indicating that the current speaker is in Germany, Austria, or some other region where German is commonly spoken.
Having determined speaker-related information, the AEFS 21.100 may then translate an utterance in a first language into an utterance in a second language. In some embodiments, the AEFS 21.100 translates an utterance by first performing speech recognition to translate the utterance into a textual representation that includes a sequence of words in the first language. Then, the AEFS 21.100 may translate the text in the first language into a message in a second language, using machine translation techniques. Speech recognition and/or machine translation may be modified, enhanced, and/or otherwise adapted based on the speaker-related information. For example, a speech recognizer may use speech or language models tailored to the speaker's gender, accent/dialect (e.g., determined based on country/region of origin), social class, or the like. As another example, a lexicon that is specific to the speaker may be used during speech recognition and/or language translation. Such a lexicon may be determined based on prior communications of the speaker, profession of the speaker (e.g., engineer, attorney, doctor), or the like.
Once the AEFS 21.100 has translated an utterance in a first language into a message in a second language, the AEFS 21.100 can present the message in the second language. Various techniques are contemplated. In one approach, the AEFS 21.100 causes the conferencing device 21.120a (or some other device accessible to the user) to visually display the message on the display 21.121. In another approach, the AEFS 21.100 causes the conferencing device 21.120a (or some other device) to “speak” or “tell” the user/speaker 21.102a the message in the second language. Presenting a message in this manner may include converting a textual representation of the message into audio via text-to-speech processing (e.g., speech synthesis), and then presenting the audio via an audio speaker (e.g., earphone, earpiece, earbud) of the conferencing device 21.120a.
At least some of the techniques described above with respect to translation may be applied in the context of generating and recording conference history information. For example, speech recognition and natural language processing may be employed by the AEFS 21.100 to transcribe user utterances, determine topics of conversation, identify information items referenced by speakers, and the like.
As an initial matter, note that the AEFS 21.100 may use output devices of a conferencing device or other devices to present information to a user, such as speaker-related information and/or conference history information that may generally assist the user in engaging in a voice conference with other participants. For example, the AEFS 21.100 may present speaker-related information about a current or previous speaker, such as his name, title, communications that reference or are related to the speaker, and the like.
For audio output, each of the illustrated conferencing devices 21.120 may include or be communicatively coupled to an audio speaker operable to generate and output audio signals that may be perceived by the user 21.102. As discussed above, the AEFS 21.100 may use such a speaker to provide speaker-related information and/or conference history information to the user 21.102. The AEFS 21.100 may also or instead audibly notify, via a speaker of a conferencing device 21.120, the user 21.102 to view information displayed on the conferencing device 21.120. For example, the AEFS 21.100 may cause a tone (e.g., beep, chime) to be played via the earpiece of the telephone 21.120f. Such a tone may then be recognized by the user 21.102, who will in response attend to information displayed on the display 21.121c. Such audible notification may be used to identify a display that is being used as a current display, such as when multiple displays are being used. For example, different first and second tones may be used to direct the user's attention to the smart phone display 21.121a and laptop display 21.121b, respectively. In some embodiments, audible notification may include playing synthesized speech (e.g., from text-to-speech processing) telling the user 21.102 to view speaker-related information and/or conference history information on a particular display device (e.g., “See email on your smart phone”).
The AEFS 21.100 may generally cause information (e.g., speaker-related information, conference history information, translations) to be presented on various destination output devices. In some embodiments, the AEFS 21.100 may use a display of a conferencing device as a target for displaying information. For example, the AEFS 21.100 may display information on the display 21.121a of the smart phone 21.120d. On the other hand, when the conferencing device does not have its own display or if the display is not suitable for displaying the determined information, the AEFS 21.100 may display information on some other destination display that is accessible to the user 21.102. For example, when the telephone 21.120f is the conferencing device and the user also has the laptop computer 21.120e in his possession, the AEFS 21.100 may elect to display an email or other substantial document upon the display 21.121b of the laptop computer 21.120e. Thus, as a general matter, a conferencing device may be any device with which a person may participate in a voice conference, by speaking, listening, seeing, or other interaction modality.
The AEFS 21.100 may determine a destination output device for conference history information, speaker-related information, translations, or other information. In some embodiments, determining a destination output device may include selecting from one of multiple possible destination displays based on whether a display is capable of displaying all of the information. For example, if the environment is noisy, the AEFS may elect to visually display a transcription or a translation rather than play it through a speaker. As another example, if the user 21.102 is proximate to a first display that is capable of displaying only text and a second display capable of displaying graphics, the AEFS 21.100 may select the second display when the presented information includes graphics content (e.g., an image). In some embodiments, determining a destination display may include selecting from one of multiple possible destination displays based on the size of each display. For example, a small LCD display (such as may be found on a mobile phone or telephone 21.120f) may be suitable for displaying a message that is just a few characters (e.g., a name or greeting) but not be suitable for displaying longer message or large document. Note that the AEFS 21.100 may select among multiple potential target output devices even when the conferencing device itself includes its own display and/or speaker.
Determining a destination output device may be based on other or additional factors. In some embodiments, the AEFS 21.100 may use user preferences that have been inferred (e.g., based on current or prior interactions with the user 21.102) and/or explicitly provided by the user. For example, the AEFS 21.100 may determine to present a transcription, translation, an email, or other speaker-related information onto the display 21.121a of the smart phone 21.120d based on the fact that the user 21.102 is currently interacting with the smart phone 21.120d.
Note that although the AEFS 21.100 is shown as being separate from a conferencing device 21.120, some or all of the functions of the AEFS 21.100 may be performed within or by the conferencing device 21.120 itself. For example, the smart phone conferencing device 21.120d and/or the laptop computer conferencing device 21.120e may have sufficient processing power to perform all or some functions of the AEFS 21.100, including one or more of speaker identification, determining speaker-related information, speaker recognition, speech recognition, generating and recording conference history information, language translation, presenting information, or the like. In some embodiments, the conferencing device 21.120 includes logic to determine where to perform various processing tasks, so as to advantageously distribute processing between available resources, including that of the conferencing device 21.120, other nearby devices (e.g., a laptop or other computing device of the user 21.102), remote devices (e.g., “cloud-based” processing and/or storage), and the like.
Other types of conferencing devices and/or organizations are contemplated. In some embodiments, the conferencing device may be a “thin” device, in that it may serve primarily as an output device for the AEFS 21.100. For example, an analog telephone may still serve as a conferencing device, with the AEFS 21.100 presenting speaker or history information via the earpiece of the telephone. As another example, a conferencing device may be or be part of a desktop computer, PDA, tablet computer, or the like.
The illustrated user interface 21.140 includes a transcript 21.141, information items 21.142-144, and a timeline control 21.145. The timeline control 21.145 includes a slider 21.146 that can be manipulated by the user (e.g., by dragging to the left or the right) to specify a time during the voice conference. In this example, the user has positioned the slider at 0:25, indicating a moment in time that is 25 minutes from the beginning of the voice conference.
In response to a time selection via the timeline control 21.145, the AEFS dynamically updates the information presented via the user interface 21.140. In this example, the transcript 21.141 is updated to present transcriptions of utterances from about the 25 minute mark of the voice conference. Each of the transcribed utterances includes a timestamp, a speaker identifier, and text. For example, the first displayed utterance was made at 23 minutes into the voice conference by speaker Joe and reads “Can we discuss the next item on the agenda, the deadline?” At 24 minutes into the voice conference, speaker Bill indicates that the deadline should be next week, stating “Well, at the earliest, I think sometime next week would be appropriate.” At 25 minutes into the voice conference, speakers Joe and Bob agree by respectively uttering “That works for me” and “I'm checking my calendar . . . that works at my end.”
The user interface 21.140 also presents information items that are related to the conference history information. In this example, the AEFS has identified and displayed three information items, including an agenda 21.142, a calendar 21.143, and an email 21.144. The user interface 21.140 may display the information items themselves (e.g., their content) and/or indications thereof (e.g., titles, icons, buttons) that may be used to access their contents. Each of the displayed information items was discussed or mentioned at or about the time specified via the timeline control 21.145. For example, at 23 and 26 minutes into the voice conference, speakers Joe and Bill each mentioned an “agenda.” In the illustrated embodiment, the AEFS determines that the term “agenda” referred to a document, an indication of which is displayed as agenda 21.142. Note also that term “agenda” is highlighted in the transcript 21.141, such as via underlining. Note also that a link 21.147 is displayed that associates the term “agenda” in the transcript 21.141 with the agenda 21.142. As further examples, the terms “calendar” and “John's email” are respectively linked to the calendar 21.143 and the email 21.144.
Note that in some embodiments the time period within a conference history that is presented by the user interface 21.140 may be selected or updated automatically. For example, as a voice conference is in progress, the conference history will typically grow (as new items or transcriptions are added to the history). The user interface 21.140 may be configured to by default automatically display history information from a time window extending back a few minutes (e.g., one, two, five, ten) from the current time. In such situations, the user interface 21.140 may present a “rolling” display of the transcript 21.141 and associated information items.
As another example, when the AEFS identifies a topic of conversation, it may automatically update the user interface 21.140 to present conference history information relevant to that topic. For instance, in the example of
The speech and language engine 22.210 includes a speech recognizer 22.212, a speaker recognizer 22.214, a natural language processor 22.216, and a language translation processor 22.218. The speech recognizer 22.212 transforms speech audio data received (e.g., from the conferencing device 21.120) into textual representation of an utterance represented by the speech audio data. In some embodiments, the performance of the speech recognizer 22.212 may be improved or augmented by use of a language model (e.g., representing likelihoods of transitions between words, such as based on n-grams) or speech model (e.g., representing acoustic properties of a speaker's voice) that is tailored to or based on an identified speaker. For example, once a speaker has been identified, the speech recognizer 22.212 may use a language model that was previously generated based on a corpus of communications and other information items authored by the identified speaker. A speaker-specific language model may be generated based on a corpus of documents and/or messages authored by a speaker. Speaker-specific speech models may be used to account for accents or channel properties (e.g., due to environmental factors or communication equipment) that are specific to a particular speaker, and may be generated based on a corpus of recorded speech from the speaker. In some embodiments, multiple speech recognizers are present, each one configured to recognize speech in a different language.
The speaker recognizer 22.214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the conferencing device 21.120. The speaker recognizer 22.214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 22.240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data.
The natural language processor 22.216 processes text generated by the speech recognizer 22.212 and/or located in information items obtained from the speaker-related information sources 21.130. In doing so, the natural language processor 22.216 may identify relationships, events, or entities (e.g., people, places, things) that may facilitate speaker identification, language translation, and/or other functions of the AEFS 21.100. For example, the natural language processor 22.216 may process status updates posted by the user 21.102a on a social networking service, to determine that the user 21.102a recently attended a conference in a particular city, and this fact may be used to identify a speaker and/or determine other speaker-related information, which may in turn be used for language translation or other functions.
In some embodiments, the natural language processor 22.216 may determine topics or subjects discussed during the course of a conference call or other conversation. Information/text processing techniques or metrics may be used to identify key terms or concepts from text obtained from a user utterances. For example, the natural language processor 22.216 may generate a term vector that associates text terms with frequency information including absolute counts, term frequency-inverse document frequency scores, or the like. The frequency information can then be used to identify important terms or concepts in the user's speech, such as by selecting those having a high score (e.g., above a certain threshold). Other text processing and/or machine learning techniques may be used to classify or otherwise determine concepts related to user utterances, including Bayesian classification, clustering, decision trees, and the like.
The language translation processor 22.218 translates from one language to another, for example, by converting text in a first language to text in a second language. The text input to the language translation processor 22.218 may be obtained from, for example, the speech recognizer 22.212 and/or the natural language processor 22.216. The language translation processor 22.218 may use speaker-related information to improve or adapt its performance. For example, the language translation processor 22.218 may use a lexicon or vocabulary that is tailored to the speaker, such as may be based on the speaker's country/region of origin, the speaker's social class, the speaker's profession, or the like.
The agent logic 22.220 implements the core intelligence of the AEFS 21.100. The agent logic 22.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to identify speakers, determine speaker-related information, generate voice conference history information, and the like. For example, the agent logic 22.220 may combine spoken text from the speech recognizer 22.212, a set of potentially matching (candidate) speakers from the speaker recognizer 22.214, and information items from the information sources 21.130, in order to determine a most likely identity of the current speaker. As another example, the agent logic 22.220 may be configured to search or otherwise analyze conference history information to identify recurring topics, information items, or the like. As a further example, the agent logic 22.220 may identify the language spoken by the speaker by analyzing the output of multiple speech recognizers that are each configured to recognize speech in a different language, to identify the language of the speech recognizer that returns the highest confidence result as the spoken language.
The presentation engine 22.230 includes a visible output processor 22.232 and an audible output processor 22.234. The visible output processor 22.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the conferencing device 21.120 or some other display (e.g., a desktop or laptop display in proximity to the user 21.102a). The agent logic 22.220 may use or invoke the visible output processor 22.232 to prepare and display information, such as by formatting or otherwise modifying a transcription, translation, or some speaker-related information to fit on a particular type or size of display. The audible output processor 22.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 22.220 may use or invoke the audible output processor 22.234 in order to convert a textual message (e.g., including or referencing speaker-related information) into audio output suitable for presentation via the conferencing device 21.120, for example by employing a text-to-speech processor.
Note that although speaker identification and/or determining speaker-related information is herein sometimes described as including the positive identification of a single speaker, it may instead or also include determining likelihoods that each of one or more persons is the current speaker. For example, the speaker recognizer 22.214 may provide to the agent logic 22.220 indications of multiple candidate speakers, each having a corresponding likelihood or confidence level. The agent logic 22.220 may then select the most likely candidate based on the likelihoods alone or in combination with other information, such as that provided by the speech recognizer 22.212, natural language processor 22.216, speaker-related information sources 21.130, or the like. In some cases, such as when there are a small number of reasonably likely candidate speakers, the agent logic 22.220 may inform the user 21.102a of the identities all of the candidate speakers (as opposed to a single speaker) candidate speaker, as such information may be sufficient to trigger the user's recall and enable the user to make a selection that informs the agent logic 22.220 of the speaker's identity.
Note that in some embodiments, one or more of the illustrated components, or components of different types, may be included or excluded. For example, in one embodiment, the AEFS 21.100 does not include the language translation processor 22.218.
FIGS. 23.1-23.94 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 23.101, the process performs receiving data representing speech signals from a voice conference amongst multiple speakers. The voice conference may be, for example, taking place between multiple speakers who are engaged in a conference call. The received data may be or represent one or more speech signals (e.g., audio samples) and/or higher-order information (e.g., frequency coefficients). In some embodiments, the process may receive data from a face-to-face conference amongst the speakers. The data may be received by or at the conferencing device 21.120 and/or the AEFS 21.100.
At block 23.102, the process performs determining speaker-related information associated with the multiple speakers, based on the data representing speech signals from the voice conference. The speaker-related information may include identifiers of a speaker (e.g., names, titles) and/or related information, such as documents, emails, calendar events, or the like. The speaker-related information may also or instead include demographic information about a speaker, including gender, language spoken, country of origin, region of origin, or the like. The speaker-related information may be determined based on signal properties of speech signals (e.g., a voice print) and/or on the semantic content of the speech signal, such as a name, event, entity, or information item that was mentioned by a speaker.
At block 23.103, the process performs recording conference history information based on the speaker-related information. In some embodiments, the process may record the voice conference and related information, so that such information can be played back at a later time, such as for reference purposes, for a participant who joins the conference late, or the like. The conference history information may associate timestamps or other time indicators with information from the voice conference, including speaker identifiers, transcriptions of speaker utterances, indications of discussion topics, mentioned information items, or the like.
At block 23.104, the process performs presenting at least some of the conference history information to a user. Presenting the conference history information may include playing back audio, displaying a transcript, presenting indications topics of conversation, or the like. In some embodiments, the conference history information may be presented on a display of a conferencing device (if it has one) or on some other display, such as a laptop or desktop display that is proximately located to the user. The conference history information may be presented in an audible and/or visible manner.
At block 23.201, the process performs recording a transcription of utterances made by speakers during the voice conference. If the process performs speech recognition as discussed herein, it may record the results of such speech recognition as a transcription of the voice conference.
At block 23.301, the process performs performing speech recognition to convert data representing a speech signal from one of the multiple speakers into text. In some embodiments, the process performs automatic speech recognition to convert audio data into text. Various approaches may be employed, including using hidden Markov models (“HMM”), neural networks, or the like. The data representing the speech signal may be frequency coefficients, such as mel-frequency coefficients or a similar representation adapted for automatic speech recognition.
At block 23.302, the process performs storing the text in association with an indicator of the one speaker. The text may be stored in a data store (e.g., disk, database, file) of the AEFS, a conferencing device, or some other system, such as a cloud-based storage system.
At block 23.401, the process performs recording indications of topics discussed during the voice conference. Topics of conversation may be identified in various ways. For example, the process may track entities or terms that are commonly mentioned during the course of the voice conference. Various text processing techniques or metrics may be applied to identify key terms or concepts, such as term frequencies, inverse document frequencies, and the like. As another example, the process may attempt to identify agenda items which are typically discussed early in the voice conference. The process may also or instead refer to messages or other information items that are related to the voice conference, such as by analyzing email headers (e.g., subject lines) of email messages sent between participants in the voice conference.
At block 23.501, the process performs performing speech recognition to convert the data representing speech signals into text. As noted, some embodiments perform speech recognition to convert audio data into text data.
At block 23.502, the process performs analyzing the text to identify frequently used terms or phrases. In some embodiments, the process maintains a term vector or other structure with respect to a transcript (or window or portion thereof) of the voice conference. The term vector may associate terms with information about corresponding frequency, such as term counts, term frequency, document frequency, inverse document frequency, or the like. The text may be processed in other ways as well, such as by stemming, stop word filtering, or the like.
At block 23.503, the process performs determining the topics discussed during the voice conference based on the frequently used terms or phrases. Terms having a high information retrieval metric value, such as term frequency or TF-IDF (term frequency-inverse document frequency), may be identified as topics of conversation. Other information processing techniques may be employed instead or in addition, such as Bayesian classification, decision trees, or the like.
At block 23.601, the process performs recording indications of information items related to subject matter of the voice conference. The process may track information items that are mentioned during the voice conference or otherwise related to participants in the voice conference, such as emails sent between participants in the voice conference.
At block 23.701, the process performs performing speech recognition to convert the data representing speech signals into text. As noted, some embodiments perform speech recognition to convert audio data into text data.
At block 23.702, the process performs analyzing the text to identify information items mentioned by the speakers. The process may use terms from the text to perform searches against a document store, email database, search index, or the like, in order to locate information items (e.g., messages, documents) that include one or more of those text terms as content or metadata (e.g., author, title, date). The process may also or instead attempt to identify information about information items, such as author, date, or title, based on the text. For example, from the text “I sent an email to John last week” the process may determine that an email message was sent to a user named John during the last week, and then use that information to narrow a search for such an email message.
At block 23.801, the process performs recording the data representing speech signals from the voice conference. The process may record speech, and then use such recordings for later playback, as a source for transcription, or for other purposes. The data may be recorded in various ways and/or formats, including in compressed formats.
At block 23.901, the process performs as each of the multiple speakers takes a turn speaking during the voice conference, recording speaker-related information associated with the speaker. The process may, in substantially real time, record speaker-related information associated a current speaker, such as a name of the speaker, a message sent by the speaker, a document drafted by the speaker, or the like.
At block 23.1001, the process performs recording conference history information based on the speaker-related information during a telephone conference call amongst the multiple speakers. In some embodiments, the process operates to record information about a telephone conference, even when some or all of the speakers are using POTS (plain old telephone service) telephones.
At block 23.1101, the process performs presenting the conference history information to a new participant in the voice conference, the new participant having joined the voice conference while the voice conference was already in progress. In some embodiments, the process may play back history information to a late arrival to the voice conference, so that the new participant may catch up with the conversation without needing to interrupt the proceedings.
At block 23.1201, the process performs presenting the conference history information to a participant in the voice conference, the participant having rejoined the voice conference after having not participated in the voice conference for a period of time. In some embodiments, the process may play back history information to a participant who leaves and then rejoins the conference, for example when a participant temporarily leaves to visit the restroom, obtain some food, or attend to some other matter.
At block 23.1401, the process performs presenting the conference history information to a user after conclusion of the voice conference. The process may record the conference history information such that it can be presented at a later date, such as for reference purposes, for legal analysis (e.g., as a deposition), or the like.
At block 23.1501, the process performs providing a user interface configured to access the conference history information by scrolling through a temporal record of the voice conference. As discussed with reference to
At block 23.1601, the process performs presenting a transcription of utterances made by speakers during the voice conference. The process may present text of what was said (and by whom) during the voice conference. The process may also mark or associate utterances with timestamps or other time indicators.
At block 23.1701, the process performs presenting indications of topics discussed during the voice conference. The process may present indications of topics discussed, such as may be determined based on terms used by speakers during the conference, as discussed above.
At block 23.1801, the process performs presenting indications of information items related to subject matter of the voice conference. The process may present relevant information items, such as emails, documents, plans, agreements, or the like mentioned or referenced by one or more speakers. In some embodiments, the information items may be related to the content of the discussion, such as because they include common key terms, even if the information items have not been directly referenced by any speaker.
At block 23.1901, the process performs presenting, while a current speaker is speaking, conference history information on a display device of the user, the displayed conference history information providing information related to previous statements made by the current speaker. For example, as the user engages in a conference call from his office, the process may present information related to statements made at an earlier time during the current voice conference or some previous voice conference.
At block 23.2001, the process performs performing voice identification based on the data representing the speech signals from the voice conference. In some embodiments, voice identification may include generating a voice print, voice model, or other biometric feature set that characterizes the voice of the speaker, and then comparing the generated voice print to previously generated voice prints.
At block 23.2101, the process performs in a conference call system, matching a portion of the data representing the speech signals with an identity of one of the multiple speakers, based on a communication channel that is associated with the one speaker and over which the portion of the data is transmitted. In some embodiments, a conference call system includes or accesses multiple distant communication channels (e.g., phone lines, sockets, pipes) that each transmit data from one of the multiple speakers. In such a situation, the conference call system can match the identity of a speaker with audio data transmitted over that speaker's communication channel.
At block 23.2201, the process performs comparing properties of the speech signal with properties of previously recorded speech signals from multiple persons. In some embodiments, the process accesses voice prints associated with multiple persons, and determines a best match against the speech signal.
At block 23.2301, the process performs processing voice messages from the multiple persons to generate voice print data for each of the multiple persons. Given a telephone voice message, the process may associate generated voice print data for the voice message with one or more (direct or indirect) identifiers corresponding with the message. For example, the message may have a sender telephone number associated with it, and the process can use that sender telephone number to do a reverse directory lookup (e.g., in a public directory, in a personal contact list) to determine the name of the voice message speaker.
At block 23.2401, the process performs processing telephone voice messages stored by a voice mail service. In some embodiments, the process analyzes voice messages to generate voice prints/models for multiple persons.
At block 23.2501, the process performs performing speech recognition to convert the data representing speech signals into text data. For example, the process may convert the received data into a sequence of words that are (or are likely to be) the words uttered by a speaker. Speech recognition may be performed by way of hidden Markov model-based systems, neural networks, stochastic modeling, or the like. In some embodiments, the speech recognition may be based on cepstral coefficients that represent the speech signal.
At block 23.2601, the process performs finding an information item that references the one speaker and/or that includes one or more words in the text data. In some embodiments, the process may search for and find a document or other item (e.g., email, text message, status update) that includes words spoken by one speaker. Then, the process can infer that the one speaker is the author of the document, a recipient of the document, a person described in the document, or the like.
At block 23.2701, the process performs retrieving information items that reference the text data. The process may here retrieve or otherwise obtain documents, calendar events, messages, or the like, that include, contain, or otherwise reference some portion of the text data.
At block 23.2702, the process performs informing the user of the retrieved information items. The information item itself, or an indication thereof (e.g., a title, a link), may be displayed.
At block 23.2801, the process performs performing speech recognition based at least in part on a language model associated with the one speaker. A language model may be used to improve or enhance speech recognition. For example, the language model may represent word transition likelihoods (e.g., by way of n-grams) that can be advantageously employed to enhance speech recognition. Furthermore, such a language model may be speaker specific, in that it may be based on communications or other information generated by the one speaker.
At block 23.2901, the process performs generating the language model based on information items generated by the one speaker, the information items including at least one of emails transmitted by the one speaker, documents authored by the one speaker, and/or social network messages transmitted by the one speaker. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like to generate a language model that is specific or otherwise tailored to the one speaker.
At block 23.3001, the process performs generating the language model based on information items generated by or referencing any of the multiple speakers, the information items including emails, documents, and/or social network messages. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, and the like generated by or referencing any of the multiple speakers to generate a language model that is tailored to the current conversation.
At block 23.3101, the process performs determining which one of the multiple speakers is speaking during a time interval. The process may determine which one of the speakers is currently speaking, even if the identity of the current speaker is not known. Various approaches may be employed, including detecting the source of a speech signal, performing voice identification, or the like.
At block 23.3201, the process performs associating a first portion of the received data with a first one of the multiple speakers. The process may correspond, bind, link, or otherwise associate a portion of the received data with a speaker. Such an association may then be used for further processing, such as voice identification, speech recognition, or the like.
At block 23.3301, the process performs receiving the first portion of the received data along with an identifier associated with the first speaker. In some embodiments, the process may receive data along with an identifier, such as an IP address (e.g., in a voice over IP conferencing system). Some conferencing systems may provide an identifier (e.g., telephone number) of a current speaker by detecting which telephone line or other circuit (virtual or physical) has an active signal.
At block 23.3401, the process performs selecting the first portion based on the first portion representing only speech from the one speaker and no other of the multiple speakers. The process may select a portion of the received data based on whether or not the received data includes speech from only one, or more than one speaker (e.g., when multiple speakers are talking over each other).
At block 23.3501, the process performs determining that two or more of the multiple speakers are speaking concurrently. The process may determine the multiple speakers are talking at the same time, and take action accordingly. For example, the process may elect not to attempt to identify any speaker, or instead identify all of the speakers who are talking out of turn.
At block 23.3601, the process performs performing voice identification to select which one of multiple previously analyzed voices is a best match for the one speaker who is speaking during the time interval. As noted above, voice identification may be employed to determine the current speaker.
At block 23.3701, the process performs performing speech recognition to convert the received data into text data. For example, the process may convert the received data into a sequence of words that are (or are likely to be) the words uttered by a speaker. Speech recognition may be performed by way of hidden Markov model-based systems, neural networks, stochastic modeling, or the like. In some embodiments, the speech recognition may be based on cepstral coefficients that represent the speech signal.
At block 23.3702, the process performs identifying one of the multiple speakers based on the text data. Given text data (e.g., words spoken by a speaker), the process may search for information items that include the text data, and then identify the one speaker based on those information items.
At block 23.3801, the process performs finding an information item that references the one speaker and that includes one or more words in the text data. In some embodiments, the process may search for and find a document or other item (e.g., email, text message, status update) that includes words spoken by one speaker. Then, the process can infer that the one speaker is the author of the document, a recipient of the document, a person described in the document, or the like.
At block 23.3901, the process performs developing a corpus of speaker data by recording speech from multiple persons. Over time, the process may gather and record speech obtained during its operation and/or from the operation of other systems (e.g., voice mail systems, chat systems).
At block 23.3902, the process performs determining the speaker-related information based at least in part on the corpus of speaker data. The process may use the speaker data in the corpus to improve its performance by utilizing actual, environmental speech data, possibly along with feedback received from the user, as discussed below.
At block 23.4001, the process performs generating a speech model associated with each of the multiple persons, based on the recorded speech. The generated speech model may include voice print data that can be used for speaker identification, a language model that may be used for speech recognition purposes, a noise model that may be used to improve operation in speaker-specific noisy environments.
At block 23.4101, the process performs receiving feedback regarding accuracy of the conference history information. During or after providing conference history information to the user, the user may provide feedback regarding its accuracy. This feedback may then be used to train a speech processor (e.g., a speaker identification module, a speech recognition module).
At block 23.4102, the process performs training a speech processor based at least in part on the received feedback.
At block 23.4201, the process performs receiving context information related to the user and/or one of the multiple speakers. Context information may generally include information about the setting, location, occupation, communication, workflow, or other event or factor that is present at, about, or with respect to the user and/or one or more of the speakers.
At block 23.4202, the process performs determining speaker-related information associated with the multiple speakers, based on the context information. Context information may be used to determine speaker-related information, such as by determining or narrowing a set of potential speakers based on the current location of a user and/or a speaker.
At block 23.4301, the process performs receiving an indication of a location of the user or the one speaker.
At block 23.4302, the process performs determining a plurality of persons with whom the user or the one speaker commonly interacts at the location. For example, if the indicated location is a workplace, the process may generate a list of co-workers, thereby reducing or simplifying the problem of speaker identification.
At block 23.4401, the process performs receiving at least one of a GPS location from a mobile device of the user or the one speaker, a network identifier that is associated with the location, an indication that the user or the one speaker is at a workplace, an indication that the user or the one speaker is at a residence, an information item that references the user or the one speaker, an information item that references the location of the user or the one speaker. A network identifier may be, for example, a service set identifier (“SSID”) of a wireless network with which the user is currently associated. In some embodiments, the process may translate a coordinate-based location (e.g., GPS coordinates) to a particular location (e.g., residence or workplace) by performing a map lookup.
At block 23.4501, the process performs presenting the conference history information on a display of a conferencing device of the user. In some embodiments, the conferencing device may include a display. For example, where the conferencing device is a smart phone or laptop computer, the conferencing device may include a display that provides a suitable medium for presenting the name or other identifier of the speaker.
At block 23.4601, the process performs presenting the conference history information on a display of a computing device that is distinct from a conferencing device of the user. In some embodiments, the conferencing device may not itself include any display or a display suitable for presenting conference history information. For example, where the conferencing device is an office phone, the process may elect to present the speaker-related information on a display of a nearby computing device, such as a desktop or laptop computer in the vicinity of the phone.
At block 23.4701, the process performs determining a display to serve as a presentation device for the conference history information. In some embodiments, there may be multiple displays available as possible destinations for the conference history information. For example, in an office setting, where the conferencing device is an office phone, the office phone may include a small LCD display suitable for displaying a few characters or at most a few lines of text. However, there will typically be additional devices in the vicinity of the conferencing device, such as a desktop/laptop computer, a smart phone, a PDA, or the like. The process may determine to use one or more of these other display devices, possibly based on the type of the conference history information being displayed.
At block 23.4801, the process performs selecting one display from multiple displays, based on at least one of: whether each of the multiple displays is capable of displaying all of the conference history information, the size of each of the multiple displays, and/or whether each of the multiple displays is suitable for displaying the conference history information. In some embodiments, the process determines whether all of the conference history information can be displayed on a given display. For example, where the display is a small alphanumeric display on an office phone, the process may determine that the display is not capable of displaying a large amount of conference history information. In some embodiments, the process considers the size (e.g., the number of characters or pixels that can be displayed) of each display. In some embodiments, the process considers the type of the conference history information. For example, whereas a small alphanumeric display on an office phone may be suitable for displaying the name of the speaker, it would not be suitable for displaying an email message sent by the speaker.
At block 23.4901, the process performs audibly notifying the user to view the conference history information on a display device. In some embodiments, notifying the user may include playing a tone, such as a beep, chime, or other type of notification. In some embodiments, notifying the user may include playing synthesized speech telling the user to view the display device. For example, the process may perform text-to-speech processing to generate audio of a textual message or notification, and this audio may then be played or otherwise output to the user via the conferencing device. In some embodiments, notifying the user may telling the user that a document, calendar event, communication, or the like is available for viewing on the display device. Telling the user about a document or other speaker-related information may include playing synthesized speech that includes an utterance to that effect. In some embodiments, the process may notify the user in a manner that is not audible to at least some of the multiple speakers. For example, a tone or verbal message may be output via an earpiece speaker, such that other parties to the conversation do not hear the notification. As another example, a tone or other notification may be into the earpiece of a telephone, such as when the process is performing its functions within the context of a telephonic conference call.
At block 23.5001, the process performs informing the user of an identifier of each of the multiple speakers. In some embodiments, the identifier of each of the speakers may be or include a given name, surname (e.g., last name, family name), nickname, title, job description, or other type of identifier of or associated with the speaker.
At block 23.5101, the process performs informing the user of information aside from identifying information related to the multiple speakers. In some embodiments, information aside from identifying information may include information that is not a name or other identifier (e.g., job title) associated with the speaker. For example, the process may tell the user about an event or communication associated with or related to the speaker.
At block 23.5201, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in
At block 23.5301, the process performs informing the user of an organization to which each of the multiple speakers belongs. In some embodiments, informing the user of an organization may include notifying the user of a business, group, school, club, team, company, or other formal or informal organization with which a speaker is affiliated. Companies may include profit or non-profit entities, regardless of organizational structure (e.g., corporation, partnerships, sole proprietorship).
At block 23.5401, the process performs informing the user of a previously transmitted communication referencing one of the multiple speakers. Various forms of communication are contemplated, including textual (e.g., emails, text messages, chats), audio (e.g., voice messages), video, or the like. In some embodiments, a communication can include content in multiple forms, such as text and audio, such as when an email includes a voice attachment.
At block 23.5501, the process performs informing the user of at least one of: an email transmitted between the one speaker and the user and/or a text message transmitted between the one speaker and the user. An email transmitted between the one speaker and the user may include an email sent from the one speaker to the user, or vice versa. Text messages may include short messages according to various protocols, including SMS, MMS, and the like.
At block 23.5601, the process performs informing the user of an event involving the user and one of the multiple speakers. An event may be any occurrence that involves or involved the user and a speaker, such as a meeting (e.g., social or professional meeting or gathering) attended by the user and the speaker, an upcoming deadline (e.g., for a project), or the like.
At block 23.5701, the process performs informing the user of a previously occurring event and/or a future event that is at least one of a project, a meeting, and/or a deadline.
At block 23.5801, the process performs accessing information items associated with one of the multiple speakers. In some embodiments, accessing information items associated with one of the multiple speakers may include retrieving files, documents, data records, or the like from various sources, such as local or remote storage devices, cloud-based servers, and the like. In some embodiments, accessing information items may also or instead include scanning, searching, indexing, or otherwise processing information items to find ones that include, name, mention, or otherwise reference a speaker.
At block 23.5901, the process performs searching for information items that reference the one speaker, the information items including at least one of a document, an email, and/or a text message. In some embodiments, searching may include formulating a search query to provide to a document management system or any other data/document store that provides a search interface. In some embodiments, emails or text messages that reference the one speaker may include messages sent from the one speaker, messages sent to the one speaker, messages that name or otherwise identify the one speaker in the body of the message, or the like.
At block 23.6001, the process performs accessing a social networking service to find messages or status updates that reference the one speaker. In some embodiments, accessing a social networking service may include searching for postings, status updates, personal messages, or the like that have been posted by, posted to, or otherwise reference the one speaker. Example social networking services include Facebook, Twitter, Google Plus, and the like. Access to a social networking service may be obtained via an API or similar interface that provides access to social networking data related to the user and/or the one speaker.
At block 23.6101, the process performs accessing a calendar to find information about appointments with the one speaker. In some embodiments, accessing a calendar may include searching a private or shared calendar to locate a meeting or other appointment with the one speaker, and providing such information to the user via the conferencing device.
At block 23.6201, the process performs accessing a document store to find documents that reference the one speaker. In some embodiments, documents that reference the one speaker include those that are authored at least in part by the one speaker, those that name or otherwise identify the speaker in a document body, or the like. Accessing the document store may include accessing a local or remote storage device/system, accessing a document management system, accessing a source control system, or the like.
At block 23.6301, the process performs receiving audio data from at least one of a telephone, a conference call, an online audio chat, a video conference, and/or a face-to-face conference that includes the multiple speakers, the received audio data representing utterances made by at least one of the multiple speakers. In some embodiments, the process may function in the context of a telephone conference, such as by receiving audio data from a system that facilitates the telephone conference, including a physical or virtual PBX (private branch exchange), a voice over IP conference system, or the like. The process may also or instead function in the context of an online audio chat, a video conference, or a face-to-face conversation.
At block 23.6401, the process performs receiving data representing speech signals from a voice conference amongst multiple speakers, wherein the multiple speakers are remotely located from one another. In some embodiments, the multiple speakers are remotely located from one another. Two speakers may be remotely located from one another even though they are in the same building or at the same site (e.g., campus, cluster of buildings), such as when the speakers are in different rooms, cubicles, or other locations within the site or building. In other cases, two speakers may be remotely located from one another by being in different cities, states, regions, or the like.
At block 23.6501, the process performs transmitting the conference history information from a first device to a second device having a display. In some embodiments, at least some of the processing may be performed on distinct devices, resulting in a transmission of conference history information from one device to another device, for example from a desktop computer or a cloud-based server to a conferencing device.
At block 23.6601, the process performs wirelessly transmitting the conference history information. Various protocols may be used, including Bluetooth, infrared, WiFi, or the like.
At block 23.6701, the process performs transmitting the conference history information from a smart phone to the second device. For example a smart phone may forward the conference history information to a desktop computing system for display on an associated monitor.
At block 23.6801, the process performs transmitting the conference history information from a server system to the second device. In some embodiments, some portion of the processing is performed on a server system that may be remote from the conferencing device.
At block 23.6901, the process performs transmitting the conference history information from a server system that resides in a data center.
At block 23.7001, the process performs transmitting the conference history information from a server system to a desktop computer, a laptop computer, a mobile device, or a desktop telephone of the user.
At block 23.7101, the process performs performing the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information associated with the multiple speakers, the recording conference history information based on the speaker-related information, and/or the presenting at least some of the conference history information on a mobile device that is operated by the user. As noted, In some embodiments a computer or mobile device such as a smart phone may have sufficient processing power to perform a portion of the process, such as identifying a speaker, determining the conference history information, or the like.
At block 23.7201, the process performs determining speaker-related information associated with the multiple speakers, performed on a smart phone or a media player that is operated by the user.
At block 23.7301, the process performs performing the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information associated with the multiple speakers, the recording conference history information based on the speaker-related information, and/or the presenting at least some of the conference history information on a general purpose computing device that is operated by the user. For example, in an office setting, a general purpose computing device (e.g., the user's desktop computer, laptop computer) may be configured to perform some or all of the process.
At block 23.7401, the process performs performing one or more of the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information associated with the multiple speakers, the recording conference history information based on the speaker-related information, and/or the presenting at least some of the conference history information on each of multiple computing systems, wherein each of the multiple systems is associated with one of the multiple speakers. In some embodiments, each of the multiple speakers has his own computing system that performs one or more operations of the method.
At block 23.7501, the process performs performing one or more of the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information associated with the multiple speakers, the recording conference history information based on the speaker-related information, and/or the presenting at least some of the conference history information within a conference call provider system. In some embodiments, a conference call provider system performs one or more of the operations of the method. For example, a Internet-based conference call system may receive audio data from participants in a voice conference, and perform various processing tasks, including speech recognition, recording conference history information, and the like.
At block 23.7601, the process performs determining to perform at least some of the receiving data representing speech signals from a voice conference amongst multiple speakers, the determining speaker-related information associated with the multiple speakers, the recording conference history information based on the speaker-related information, and/or the presenting at least some of the conference history information on another computing device that has available processing capacity. In some embodiments, the process may determine to offload some of its processing to another computing device or system.
At block 23.7701, the process performs receiving at least some of speaker-related information or the conference history information from the another computing device. The process may receive the speaker-related information or the conference history information or a portion thereof from the other computing device.
At block 23.7801, the process performs selecting a portion of the conference history information based on capabilities of a device operated by the user. In some embodiments, the process selects a portion of the recorded conference history information based on device capabilities, such as processing power, memory, display capabilities, or the like.
At block 23.7802, the process performs transmitting the selected portion for presentation on the device operated by the user. The process may then transmit just the selected portion to the device. For example, if a user is using a mobile phone having limited memory, the process may elect not to transmit previously recorded audio to the mobile phone and instead only transmit the text transcription of the voice conference. As another example, if the mobile phone has a limited display, the process may only send information items that can be readily presented on the display.
At block 23.7901, the process performs performing speech recognition to convert an utterance of one of the multiple speakers into text, the speech recognition performed at a mobile device of the one speaker. In some embodiments, a mobile device (e.g., a cell phone, smart phone) of a speaker may perform speech recognition on the speaker's utterances. As discussed below, the results of the speech recognition may then be transmitted to some remote system or device.
At block 23.7902, the process performs transmitting the text along with an audio representation of the utterance and an identifier of the speaker to a remote conferencing device and/or a conference call system. After having performed the speech recognition, the mobile device may transmit the obtained text along with an identifier of the speaker and the audio representation of the speaker's utterance to a remote system or device. In this manner, the speech recognition load may be distributed among multiple distributed communication devices used by the speakers in the voice conference.
At block 23.8001, the process performs translating an utterance of one of the multiple speakers in a first language into a message in a second language, based on the speaker-related information. In some embodiments, the process may also perform language translation, such that a voice conference may be held between speakers of different languages. In some embodiments, the utterance may be translated by first performing speech recognition on the data representing the speech signal to convert the utterance into textual form. Then, the text of the utterance may be translated into the second language using a natural language processing and/or machine translation techniques. The speaker-related information may be used to improve, enhance, or otherwise modify the process of machine translation. For example, based on the identity of the one speaker, the process may use a language or speech model that is tailored to the one speaker in order to improve a machine translation process. As another example, the process may use one or more information items that reference the one speaker to improve machine translation, such as by disambiguating references in the utterance of the one speaker.
At block 23.8002, the process performs recording the message in the second language as part of the conference history information. The message may be recorded as part of the conference history information for later presentation. The conference history information may of course be presented in various ways including using audible output (e.g., via text-to-speech processing of the message) and/or using visible output of the message (e.g., via a display screen of the conferencing device or some other device that is accessible to the user).
At block 23.8101, the process performs determining the first language. In some embodiments, the process may determine or identify the first language, possibly prior to performing language translation. For example, the process may determine that the one speaker is speaking in German, so that it can configure a speech recognizer to recognize German language utterances. In some embodiments, determining the first language may include concurrently processing the received data with multiple speech recognizers that are each configured to recognize speech in a different corresponding language (e.g., German, French, Spanish). Then, the process may select as the first language the language corresponding to a speech recognizer of the multiple speech recognizers that produces a result that has a higher confidence level than other of the multiple speech recognizers. In some embodiments, determining the language may be based on one or more of signal characteristics that are correlated with the first language, the location of the user or the speaker, user inputs, or the like.
At block 23.8201, the process performs performing speech recognition, based on the speaker-related information, on the data representing the speech signal to convert the utterance in the first language into text representing the utterance in the first language. The speech recognition process may be improved, augmented, or otherwise adapted based on the speaker-related information. In one example, information about vocabulary frequently used by the one speaker may be used to improve the performance of a speech recognizer.
At block 23.8202, the process performs translating, based on the speaker-related information, the text representing the utterance in the first language into text representing the message in the second language. Translating from a first to a second language may also be improved, augmented, or otherwise adapted based on the speaker-related information. For example, when such a translation includes natural language processing to determine syntactic or semantic information about an utterance, such natural language processing may be improved with information about the one speaker, such as idioms, expressions, or other language constructs frequently employed or otherwise correlated with the one speaker.
At block 23.8301, the process performs performing speech synthesis to convert the text representing the utterance in the second language into audio data representing the message in the second language.
At block 23.8302, the process performs causing the audio data representing the message in the second language to be played to the user. The message may be played, for example, via an audio speaker of the conferencing device.
At block 23.8401, the process performs translating the utterance based on speaker-related information including a language model that is adapted to the one speaker. A speaker-adapted language model may include or otherwise identify frequent words or patterns of words (e.g., n-grams) based on prior communications or other information about the one speaker. Such a language model may be based on communications or other information generated by or about the one speaker. Such a language model may be employed in the course of speech recognition, natural language processing, machine translation, or the like. Note that the language model need not be unique to the one speaker, but may instead be specific to a class, type, or group of speakers that includes the one speaker. For example, the language model may be tailored for speakers in a particular industry, from a particular region, or the like.
At block 23.8501, the process performs translating the utterance based on speaker-related information including a language model adapted to the voice conference. A language model adapted to the voice conference may include or otherwise identify frequent words or patterns of words (e.g., n-grams) based on prior communications or other information about any one or more of the speakers in the voice conference. Such a language model may be based on communications or other information generated by or about the speakers in the voice conference. Such a language model may be employed in the course of speech recognition, natural language processing, machine translation, or the like.
At block 23.8601, the process performs generating the language model based on information items by or about any of the multiple speakers, the information items including at least one of emails, documents, and/or social network messages. In some embodiments, the process mines or otherwise processes emails, text messages, voice messages, social network messages, and the like to generate a language model that is tailored to the voice conference.
At block 23.8701, the process performs translating the utterance based on speaker-related information including a language model developed with respect to a corpus of related content. In some embodiments, the process may use language models developed with respect to a corpus of related content, such as may be obtained from past voice conferences, academic conferences, documentaries, or the like. For example, if the current voice conference is about a particular technical subject, the process may refer to a language model from a prior academic conference directed to the same technical subject. Such a language model may be based on an analysis of academic papers and/or transcriptions from the academic conference.
At block 23.8901, the process performs receiving digital samples of an audio wave captured by a microphone. In some embodiments, the microphone may be a microphone of a conferencing device operated by a speaker. The samples may be raw audio samples or in some compressed format.
At block 23.9001, the process performs receiving a recorded voice samples from a storage device. In some embodiments, the process receives audio data from a storage device, such as a magnetic disk, a memory, or the like. The audio data may be stored or buffered on the storage device.
At block 23.9401, the process performs determining to perform one or more of archiving, indexing, searching, removing, redacting, duplicating, or deleting some of the conference history information based on a data retention policy. In some embodiments, the process may determine to perform various operations in accordance with a data retention policy. For example, an organization may elect to record conference history information for all conference calls for a specified time period. In such cases, the process may be configured to automatically delete conference history information after a specified time interval (e.g., one year, six months). As another example, the process may redact the names or other identifiers of speakers in the conference history information associated with a conference call.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 21.100. In addition, the computing system 24.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 21.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 24.400 comprises a computer memory (“memory”) 24.401, a display 24.402, one or more Central Processing Units (“CPU”) 24.403, Input/Output devices 24.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 24.405, and network connections 24.406. The AEFS 21.100 is shown residing in memory 24.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 21.100 may be stored on and/or transmitted over the other computer-readable media 24.405. The components of the AEFS 21.100 preferably execute on one or more CPUs 24.403 and facilitate ability enhancement, as described herein. Other code or programs 24.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 24.420, also reside in the memory 24.401, and preferably execute on one or more CPUs 24.403. Of note, one or more of the components in
The AEFS 21.100 interacts via the network 24.450 with conferencing devices 21.120, speaker-related information sources 21.130, and third-party systems/applications 24.455. The network 24.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 24.455 may include any systems that provide data to, or utilize data from, the AEFS 21.100, including Web browsers, e-commerce sites, calendar applications, email systems, social networking services, and the like.
The AEFS 21.100 is shown executing in the memory 24.401 of the computing system 24.400. Also included in the memory are a user interface manager 24.415 and an application program interface (“API”) 24.416. The user interface manager 24.415 and the API 24.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 21.100.
The UI manager 24.415 provides a view and a controller that facilitate user interaction with the AEFS 21.100 and its various components. For example, the UI manager 24.415 may provide interactive access to the AEFS 21.100, such that users can configure the operation of the AEFS 21.100, such as by providing the AEFS 21.100 credentials to access various sources of speaker-related information, including social networking services, email systems, document stores, or the like. In some embodiments, access to the functionality of the UI manager 24.415 may be provided via a Web server, possibly executing as one of the other programs 24.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 24.455 can interact with the AEFS 21.100 via the UI manager 24.415.
The API 24.416 provides programmatic access to one or more functions of the AEFS 21.100. For example, the API 24.416 may provide a programmatic interface to one or more functions of the AEFS 21.100 that may be invoked by one of the other programs 24.430 or some other module. In this manner, the API 24.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 21.100 into Web applications), and the like.
In addition, the API 24.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the conferencing devices 21.120, information sources 21.130, and/or one of the third-party systems/applications 24.455, to access various functions of the AEFS 21.100. For example, an information source 21.130 may push speaker-related information (e.g., emails, documents, calendar events) to the AEFS 21.100 via the API 24.416. The API 24.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 24.455 and that are configured to interact with the AEFS 21.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 21.100 are implemented using standard programming techniques. For example, the AEFS 21.100 may be implemented as a “native” executable running on the CPU 24.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 21.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 24.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 21.100, such as in the data store 24.420 (or 22.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 24.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 21.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for ability enhancement and, more particularly, for enhancing a user's ability to operate or function in a transportation-related context (e.g., as a pedestrian or vehicle operator) by performing vehicular threat detection based at least in part on analyzing image data that represents vehicles and other objects present in a roadway or other context. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”). Embodiments of the AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities (e.g., driving, riding a bike, walking/running) of a user.
In some embodiments, the AEFS is configured to identify threats (e.g., posed by vehicles to a user of a roadway, posed by a user to vehicles or other users of a roadway), and to provide information about such threats to the user so that he may take evasive action. Identifying threats may include analyzing information about a vehicle that is present in the roadway in order to determine whether the user and the vehicle may be on a collision course. The analyzed information may include or be represented by image data (e.g., pictures or video of a roadway and its surrounding environment), audio data (e.g., sounds reflected from or emitted by a vehicle), range information (e.g., provided by a sonar or infrared range sensor), conditions information (e.g., weather, temperature, time of day), or the like. The user may be a pedestrian (e.g., a walker, a jogger), an operator of a motorized (e.g., car, motorcycle, moped, scooter) or non-motorized vehicle (e.g., bicycle, pedicab, rickshaw), a vehicle passenger, or the like. In some embodiments, the vehicle may be operating autonomously. In some embodiments, the user wears a wearable device (e.g., a helmet, goggles, eyeglasses, hat) that is configured to at least present determined vehicular threat information to the user.
In some embodiments, the AEFS is configured to receive image data, at least some of which represents and image of a first vehicle. The image data may be obtained from various sources, including a camera of a wearable device of a user, a camera on a vehicle of the user, an in-situ road-side camera, a camera on some other vehicle, or the like. The image data may represent electromagnetic signals of various types or in various ranges, including visual signals (e.g., signals having a wavelength in the range of about 390-750 nm), infrared signals (e.g., signals having a wavelength in the range of about 750 nm-300 micrometers), or the like.
Then, the AEFS determines vehicular threat information based at least in part on the image data. In some embodiments, the AEFS may analyze the received image data in order to identify the first vehicle and/or to determine whether the first vehicle represents a threat to the user, such as because the first vehicle and the user may be on a collision course. The image data may be analyzed in various ways, including by identifying objects (e.g., to recognize that a vehicle or some other object is shown in the image data), determining motion-related information (e.g., position, velocity, acceleration, mass) about objects, or the like.
Next, the AEFS informs the user of the determined vehicular threat information via a wearable device of the user. Typically, the user's wearable device (e.g., a helmet) will include one or more output devices, such as audio speakers, visual display devices (e.g., warning lights, screens, heads-up displays), haptic devices, and the like. The AEFS may present the vehicular threat information via one or more of these output devices. For example, the AEFS may visually display or speak the words “Car on left.” As another example, the AEFS may visually display a leftward pointing arrow on a heads-up screen displayed on a face screen of the user's helmet. Presenting the vehicular threat information may also or instead include presenting a recommended course of action (e.g., to slow down, to speed up, to turn) to mitigate the determined vehicular threat.
The AEFS may use other or additional sources or types of information. For example, in some embodiments, the AEFS is configured to receive data representing an audio signal emitted by a first vehicle. The audio signal is typically obtained in proximity to a user, who may be a pedestrian or traveling in a vehicle as an operator or a passenger. In some embodiments, the audio signal is obtained by one or more microphones coupled to the user's vehicle and/or a wearable device of the user, such as a helmet, goggles, a hat, a media player, or the like. Then, the AEFS may determine vehicular threat information based at least in part on the data representing the audio signal. In some embodiments, the AEFS may analyze the received data in order to determine whether the first vehicle and the user are on a collision course. The audio data may be analyzed in various ways, including by performing audio analysis, frequency analysis (e.g., Doppler analysis), acoustic localization, or the like.
The AEFS may combine information of various types in order to determine vehicular threat information. For example, because image processing may be computationally expensive, rather than always processing all image data obtained from every possible source, the AEFS may use audio analysis to initially determine the approximate location of an oncoming vehicle, such as to the user's left, right, or rear. For example, having determined based on audio data that a vehicle may be approaching from the rear of the user, the AEFS may preferentially process image data from a rear-facing camera to further refine a threat analysis. As another example, the AEFS may incorporate information about the condition of a roadway (e.g., icy or wet) when determining whether a vehicle will be able to stop or maneuver in order to avoid an accident.
In this example, the moped 25.110a is driving towards the motorcycle 25.110b from a side street, at approximately a right angle with respect to the path of travel of the motorcycle 25.110b. The traffic signal 25.106 has just turned from red to green for the motorcycle 25.110b, and the user 25.104 is beginning to drive the motorcycle 25.110 into the intersection controlled by the traffic signal 25.106. The user 25.104 is assuming that the moped 25.110a will stop, because cross traffic will have a red light. However, in this example, the moped 25.110a may not stop in a timely manner, for one or more reasons, such as because the operator of the moped 25.110a has not seen the red light, because the moped 25.110a is moving at an excessive rate, because the operator of the moped 25.110a is impaired, because the surface conditions of the roadway are icy or slick, or the like. As will be discussed further below, the AEFS 25.100 will determine that the moped 25.110a and the motorcycle 25.110b are likely on a collision course, and inform the user 25.104 of this threat via the helmet 25.120a, so that the user may take evasive action to avoid a possible collision with the moped 25.110a.
The moped 25.110 emits or reflects a signal 25.101. In some embodiments, the signal 25.101 is an electromagnetic signal in the visible light spectrum that represents an image of the moped 25.110a. Other types of electromagnetic signals may be received and processed, including infrared radiation, radio waves, microwaves, or the like. Other types of signals are contemplated, including audio signals, such as an emitted engine noise, a reflected sonar signal, a vocalization (e.g., shout, scream), etc. The signal 25.101 may be received by a receiving detector/device/sensor, such as a camera or microphone (not shown) on the helmet 25.120a and/or the motorcycle 25.110b. In some embodiments, a computing and communication device within the helmet 25.120a receives and samples the signal 25.101 and transmits the samples or other representation to the AEFS 25.100. In other embodiments, other forms of data may be used to represent the signal 25.101, including frequency coefficients, compressed audio/video, or the like.
The AEFS 25.100 determines vehicular threat information by analyzing the received data that represents the signal 25.101. If the signal 25.101 is a visual signal, then the AEFS 25.100 may employ various image data processing techniques. For example, the AEFS 25.100 may perform object recognition to determine that received image data includes an image of a vehicle, such as the moped 25.110a. The AEFS 25.100 may also or instead process received image data to determine motion-related information with respect to the moped 25.110, including position, velocity, acceleration, or the like. The AEFS 25.100 may further identify the presence of other objects, including pedestrians, animals, structures, or the like, that may pose a threat to the user 25.104 or that may be themselves threatened (e.g., by actions of the user 25.104 and/or the moped 25.110a). Image processing also may be employed to determine other information, including road conditions (e.g., wet or icy roads), visibility conditions (e.g., glare or darkness), and the like.
If the signal 25.101 is an audio signal, then the AEFS 25.100 may use one or more audio analysis techniques to determine the vehicular threat information. In one embodiment, the AEFS 25.100 performs a Doppler analysis (e.g., by determining whether the frequency of the audio signal is increasing or decreasing) to determine that the object that is emitting the audio signal is approaching (and possibly at what rate) the user 25.104. In some embodiments, the AEFS 25.100 may determine the type of vehicle (e.g., a heavy truck, a passenger vehicle, a motorcycle, a moped) by analyzing the received data to identify an audio signature that is correlated with a particular engine type or size. For example, a lower frequency engine sound may be correlated with a larger vehicle size, and a higher frequency engine sound may be correlated with a smaller vehicle size.
In one embodiment, where the signal 25.101 is an audio signal, the AEFS 25.100 performs acoustic source localization to determine information about the trajectory of the moped 25.110a, including one or more of position, direction of travel, speed, acceleration, or the like. Acoustic source localization may include receiving data representing the audio signal 25.101 as measured by two or more microphones. For example, the helmet 25.120a may include four microphones (e.g., front, right, rear, and left) that each receive the audio signal 25.101. These microphones may be directional, such that they can be used to provide directional information (e.g., an angle between the helmet and the audio source). Such directional information may then be used by the AEFS 25.100 to triangulate the position of the moped 25.110a. As another example, the AEFS 25.100 may measure differences between the arrival time of the audio signal 25.101 at multiple distinct microphones on the helmet 25.120a or other location. The difference in arrival time, together with information about the distance between the microphones, can be used by the AEFS 25.100 to determine distances between each of the microphones and the audio source, such as the moped 25.110a. Distances between the microphones and the audio source can then be used to determine one or more locations at which the audio source may be located.
Determining vehicular threat information may also or instead include obtaining information such as the position, trajectory, and speed of the user 25.104, such as by receiving data representing such information from sensors, devices, and/or systems on board the motorcycle 25.110b and/or the helmet 25.120a. Such sources of information may include a speedometer, a geo-location system (e.g., GPS system), an accelerometer, or the like. Once the AEFS 25.100 has determined and/or obtained information such as the position, trajectory, and speed of the moped 25.110a and the user 25.104, the AEFS 25.100 may determine whether the moped 25.110a and the user 25.104 are likely to collide with one another. For example, the AEFS 25.100 may model the expected trajectories of the moped 25.110a and user 25.104 to determine whether they intersect at or about the same point in time.
The AEFS 25.100 may then present the determined vehicular threat information (e.g., that the moped 25.110a represents a hazard) to the user 25.104 via the helmet 25.120a. Presenting the vehicular threat information may include transmitting the information to the helmet 25.120a, where it is received and presented to the user. In one embodiment, the helmet 25.120a includes audio speakers that may be used to output an audio signal (e.g., an alarm or voice message) warning the user 25.104. In other embodiments, the helmet 25.120a includes a visual display, such as a heads-up display presented upon a face screen of the helmet 25.120a, which can be used to present a text message (e.g., “Look left”) or an icon (e.g., a red arrow pointing left).
The AEFS 25.100 may also use information received from in-situ sensors and/or devices. For example, the AEFS 25.100 may use information received from a camera 25.108 that is mounted on the traffic signal 25.106 that controls the illustrated intersection. The AEFS 25.100 may receive image data that represents the moped 25.110a and/or the motorcycle 25.110b. The AEFS 25.100 may perform image recognition to determine the type and/or position of a vehicle that is approaching the intersection. The AEFS 25.100 may also or instead analyze multiple images (e.g., from a video signal) to determine the velocity of a vehicle. Other types of sensors or devices installed in or about a roadway may also or instead by used, including range sensors, speed sensors (e.g., radar guns), induction coils (e.g., mounted in the roadbed), temperature sensors, weather gauges, or the like.
As noted above, the AEFS 25.100 may utilize data that represents a signal as detected by one or more detectors/sensors, such as microphones or cameras. In the example of
In an image context, the AEFS 25.100 may perform image processing on image data obtained from one or more of the camera sensors 25.124a and 25.124b. As discussed, the image data may be processed to determine the presence of the moped, its type, its motion-related information (e.g., velocity), and the like. In some embodiments, image data may be processed without making any definite identification of a vehicle. For example, the AEFS 25.100 may process image data from sensors 25.124a and 25.124b to identify the presence of motion (without necessarily identifying any objects). Based on such an analysis, the AEFS 25.100 may determine that there is something approaching from the left of the motorcycle 25.110b, but that the right of the motorcycle 25.110b is relatively clear.
Differences between data obtained from multiple sensors may be exploited in various ways. In an image context, an image signal may be perceived or captured differently by the two (camera) sensors 25.124a and 25.124b. The AEFS 25.100 may exploit or otherwise analyze such differences to determine the location and/or motion of the moped 25.110a. For example, knowing the relative position and optical qualities of the two cameras, it is possible to analyze images captured by those cameras to triangulate a position of an object (e.g., the moped 25.110a) or a distance between the motorcycle 25.110b and the object.
In an audio context, an audio signal may be perceived differently by the two sensors 25.124a and 25.124b. For example, if the strength of the signal 25.101 is stronger as measured at microphone 25.124a than at microphone 25.124b, the AEFS 25.100 may infer that the signal 25.101 is originating from the driver's left of the motorcycle 25.110b, and thus that a vehicle is approaching from that direction. As another example, as the strength of an audio signal is known to decay with distance, and assuming an initial level (e.g., based on an average signal level of a vehicle engine) the AEFS 25.100 may determine a distance (or distance interval) between one or more of the microphones and the signal source.
The AEFS 25.100 may model vehicles and other objects, such as by representing their motion-related information, including position, speed, acceleration, mass and other properties. Such a model may then be used to determine whether objects are likely to collide. Note that the model may be probabilistic. For example the AEFS 25.100 may represent an object's position in space as a region that includes multiple positions that each have a corresponding likelihood that that the object is at that position. As another example, the AEFS 25.100 may represent the velocity of an object as a range of likely values, a probability distribution, or the like. Various frames of reference may be employed, including a user-centric frame, an absolute frame, or the like.
The AEFS 25.100 may interact with various types of wearable devices 25.120, including a motorcycle helmet 25.120a (
In some embodiments, a wearable device may perform some or all of the functions of the AEFS 25.100, even though the AEFS 25.100 is depicted as separate in these examples. Some devices may have minimal processing power and thus perform only some of the functions. For example, the eyeglasses 25.120b may receive vehicular threat information from a remote AEFS 25.100, and display it on a heads-up display displayed on the inside of the lenses of the eyeglasses 25.120b. Other wearable devices may have sufficient processing power to perform more of the functions of the AEFS 25.100. For example, the personal media device 25.120e may have considerable processing power and as such be configured to perform acoustic source localization, collision detection analysis, or other more computational expensive functions.
Note that the wearable devices 25.120 may act in concert with one another or with other entities to perform functions of the AEFS 25.100. For example, the eyeglasses 25.120b may include a display mechanism that receives and displays vehicular threat information determined by the personal media device 25.120e. As another example, the goggles 25.120c may include a display mechanism that receives and displays vehicular threat information determined by a computing device in the helmet 25.120a or 25.120d. In a further example, one of the wearable devices 25.120 may receive and process audio data received by microphones mounted on the vehicle 25.110c.
The AEFS 25.100 may also or instead interact with vehicles 25.110 and/or computing devices installed thereon. As noted, a vehicle 25.110 may have one or more sensors or devices that may operate as (direct or indirect) sources of information for the AEFS 25.100. The vehicle 25.110c, for example, may include a speedometer, an accelerometer, one or more microphones, one or more range sensors, or the like. Data obtained by, at, or from such devices of vehicle 25.110c may be forwarded to the AEFS 25.100, possibly by a wearable device 25.120 of an operator of the vehicle 25.110c.
In some embodiments, the vehicle 25.110c may itself have or use an AEFS, and be configured to transmit warnings or other vehicular threat information to others. For example, an AEFS of the vehicle 25.110c may have determined that the moped 25.110a was driving with excessive speed just prior to the scenario depicted in
The AEFS 25.100 may also or instead interact with sensors and other devices that are installed on, in, or about roads or in other transportation related contexts, such as parking garages, racetracks, or the like. In this example, the AEFS 25.100 interacts with the camera 25.108 to obtain images of vehicles, pedestrians, or other objects present in a roadway. Other types of sensors or devices may include range sensors, infrared sensors, induction coils, radar guns, temperature gauges, precipitation gauges, or the like.
The AEFS 25.100 may further interact with information systems that are not shown in
In some embodiments, the AEFS 25.100 may transmit information to law enforcement agencies and/or related computing systems. For example, if the AEFS 25.100 determines that a vehicle is driving erratically, it may transmit that fact along with information about the vehicle (e.g., make, model, color, license plate number, location) to a police computing system.
Note that in some embodiments, at least some of the described techniques may be performed without the utilization of any wearable devices 25.120. For example, a vehicle 25.110 may itself include the necessary computation, input, and output devices to perform functions of the AEFS 25.100. For example, the AEFS 25.100 may present vehicular threat information on output devices of a vehicle 25.110, such as a radio speaker, dashboard warning light, heads-up display, or the like. As another example, a computing device on a vehicle 25.110 may itself determine the vehicular threat information.
In some embodiments, the AEFS 25.100 processes the image 25.140 to perform object identification. Upon processing the image 25.140, the AEFS 25.100 may identify the moped 25.110a, the child 25.141, the sun 25.142, and/or the puddle 25.143. A sequence of images, taken at different times (e.g., one tenth of a second apart) may be used to determine that the moped 25.110a is moving, how fast the moped 25.110a is moving, acceleration/deceleration of the moped 25.110a, or the like. Motion of other objects, such as the child 25.141 may also be tracked. Based on such motion-related information, the AEFS 25.100 may model the physics of the identified objects to determine whether a collision is likely.
Determining vehicular threat information may also or instead be based on factors related or relevant to objects other than the moped 25.110a or the user 25.104. For example, the AEFS 25.100 may determine that the puddle 25.143 will likely make it more difficult for the moped 25.110a to stop. Thus, even if the moped 25.110a is moving at a reasonable speed, he still may be unable to stop prior to entering the intersection due to the presence of the puddle 25.143. As another example, the AEFS 25.100 may determine that evasive action by the user 25.104 and/or the moped 25.110a may cause injury to the child 25.141. As a further example, the AEFS 25.100 may determine that it may be difficult for the user 25.104 to see the moped 25.110a and/or the child 25.141 due to the position of the sun 25.142. Such information may be incorporated into any models, predictions, or determinations made or maintained by the AEFS 25.100.
The threat analysis engine 26.210 includes an audio processor 26.212, an image processor 26.214, other sensor data processors 26.216, and an object tracker 26.218. In the illustrated example, the audio processor 26.212 processes audio data received from the wearable device 25.120. As noted, such data may be received from other sources as well or instead, including directly from a vehicle-mounted microphone, or the like. The audio processor 26.212 may perform various types of signal processing, including audio level analysis, frequency analysis, acoustic source localization, or the like. Based on such signal processing, the audio processor 26.212 may determine strength, direction of audio signals, audio source distance, audio source type, or the like. Outputs of the audio processor 26.212 (e.g., that an object is approaching from a particular angle) may be provided to the object tracker 26.218 and/or stored in the data store 26.240.
The image processor 26.214 receives and processes image data that may be received from sources such as the wearable device 25.120 and/or information sources 25.130. For example, the image processor 26.214 may receive image data from a camera of the wearable device 25.120, and perform object recognition to determine the type and/or position of a vehicle that is approaching the user 25.104. As another example, the image processor 26.214 may receive a video signal (e.g., a sequence or stream of images) and process them to determine the type, position, and/or velocity of a vehicle that is approaching the user 25.104. Multiple images may be processed to determine the presence or absence of motion, even if no object recognition is performed. Outputs of the image processor 26.214 (e.g., position and velocity information, vehicle type information) may be provided to the object tracker 26.218 and/or stored in the data store 26.240.
The other sensor data processor 26.216 receives and processes data received from other sensors or sources. For example, the other sensor data processor 26.216 may receive and/or determine information about the position and/or movements of the user and/or one or more vehicles, such as based on GPS systems, speedometers, accelerometers, or other devices. As another example, the other sensor data processor 26.216 may receive and process conditions information (e.g., temperature, precipitation) from the information sources 25.130 and determine that road conditions are currently icy. Outputs of the other sensor data processor 26.216 (e.g., that the user is moving at 5 miles per hour) may be provided to the object tracker 26.218 and/or stored in the data store 26.240.
The object tracker 26.218 manages a geospatial object model that includes information about objects known to the AEFS 25.100. The object tracker 26.218 receives and merges information about object types, positions, velocity, acceleration, direction of travel, and the like, from one or more of the processors 26.212, 26.214, 26.216, and/or other sources. Based on such information, the object tracker 26.218 may identify the presence of objects as well as their likely positions, paths, and the like. The object tracker 26.218 may continually update this model as new information becomes available and/or as time passes (e.g., by plotting a likely current position of an object based on its last measured position and trajectory). The object tracker 26.218 may also maintain confidence levels corresponding to elements of the geo-spatial model, such as a likelihood that a vehicle is at a particular position or moving at a particular velocity, that a particular object is a vehicle and not a pedestrian, or the like.
The agent logic 26.220 implements the core intelligence of the AEFS 25.100. The agent logic 26.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to determine vehicular threat information. For example, the agent logic 26.220 may combine information from the object tracker 26.218, such as that there is a determined likelihood of a collision at an intersection, with information from one of the information sources 25.130, such as that the intersection is the scene of common red-light violations, and decide that the likelihood of a collision is high enough to transmit a warning to the user 25.104. As another example, the agent logic 26.220 may, in the face of multiple distinct threats to the user, determine which threat is the most significant and cause the user to avoid the more significant threat, such as by not directing the user 25.104 to slam on the brakes when a bicycle is approaching from the side but a truck is approaching from the rear, because being rear-ended by the truck would have more serious consequences than being hit from the side by the bicycle.
The presentation engine 26.230 includes a visible output processor 26.232 and an audible output processor 26.234. The visible output processor 26.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the wearable device 25.120 or some other display (e.g., a heads-up display of a vehicle 25.110 being driven by the user 25.104). The agent logic 26.220 may use or invoke the visible output processor 26.232 to prepare and display information, such as by formatting or otherwise modifying vehicular threat information to fit on a particular type or size of display. The audible output processor 26.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 26.220 may use or invoke the audible output processor 26.234 in order to convert a textual message (e.g., a warning message, a threat identification) into audio output suitable for presentation via the wearable device 25.120, for example by employing a text-to-speech processor.
Note that one or more of the illustrated components/modules may not be present in some embodiments. For example, in embodiments that do not perform image or video processing, the AEFS 25.100 may not include an image processor 26.214. As another example, in embodiments that do not perform audio output, the AEFS 25.100 may not include an audible output processor 26.234.
Note also that the AEFS 25.100 may act in service of multiple users 25.104. In some embodiments, the AEFS 25.100 may determine vehicular threat information concurrently for multiple distinct users. Such embodiments may further facilitate the sharing of vehicular threat information. For example, vehicular threat information determined as between two vehicles may be relevant and thus shared with a third vehicle that is in proximity to the other two vehicles.
FIGS. 27.1-27.112 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 27.101, the process performs receiving image data, at least some of which represents an image of a first vehicle. The process may receive and consider image data, such as by performing image processing to identify vehicles or other hazards, to determine whether collisions may occur, determine motion-related information about the first vehicle (and possibly other entities), and the like. The image data may be obtained from various sources, including from a camera attached to the wearable device or a vehicle, a road-side camera, or the like.
At block 27.102, the process performs determining vehicular threat information based at least in part on the image data. Vehicular threat information may include information related to threats posed by the first vehicle (e.g., to the user or to some other entity), by a vehicle occupied by the user (e.g., to the first vehicle or to some other entity), or the like. Note that vehicular threats may be posed by vehicles to non-vehicles, including pedestrians, animals, structures, or the like. Vehicular threats may also include those threats posed by non-vehicles (e.g., structures, pedestrians) to vehicles. Vehicular threat information may be determined in various ways, including by analyzing image data to identify objects, such as vehicles, pedestrians, fixed objects, and the like. In some embodiments, determining the vehicular threat information may also or instead include determining motion-related information about identified objects, including position, velocity, direction of travel, accelerations, or the like. Determining the vehicular threat information may also or instead include predicting whether the path of the user and one or more identified objects may intersect.
At block 27.103, the process performs presenting the vehicular threat information via a wearable device of the user. The determined threat information may be presented in various ways, such as by presenting an audible or visible warning or other indication that the first vehicle is approaching the user. Different types of wearable devices are contemplated, including helmets, eyeglasses, goggles, hats, and the like. In other embodiments, the vehicular threat information may also or instead be presented in other ways, such as via an output device on a vehicle of the user, in-situ output devices (e.g., traffic signs, road-side speakers), or the like.
At block 27.201, the process performs receiving image data from a camera of a vehicle that is occupied by the user. The user's vehicle may include one or more cameras that may capture views to the front, sides, and/or rear of the vehicle, and provide these images to the process for image processing or other analysis.
At block 27.501, the process performs receiving image data from a camera of the wearable device. For example, where the wearable device is a helmet, the helmet may include one or more helmet cameras that may capture views to the front, sides, and/or rear of the helmet.
At block 27.601, the process performs receiving image data from a camera of the first vehicle. In some embodiments, the first vehicle may itself have cameras and broadcast or otherwise transmit image data obtained via that camera.
At block 27.701, the process performs receiving image data from a camera of a vehicle that is not the first vehicle and that is not occupied by the user. In some embodiments, other vehicles in the roadway may have cameras and broadcast or otherwise transmit image data obtained via those cameras. For example, some vehicle traveling between the user and the first vehicle may transmit images of the first vehicle to be received by the process as image data.
At block 27.801, the process performs receiving image data from a road-side camera. In some embodiments, road side cameras, such as may be mounted on traffic lights, utility poles, buildings, or the like may transmit image data to the process.
At block 27.901, the process performs receiving video data that includes multiple images of the first vehicle taken at different times. In some embodiments, the image data comprises video data in compressed or raw form. The video data typically includes (or can be reconstructed or decompressed to derive) multiple sequential images taken at distinct times.
At block 27.1001, the process performs receiving a first image of the first vehicle taken at a first time.
At block 27.1002, the process performs receiving a second image of the second vehicle taken at a second time, wherein the first and second times are sufficiently different such that velocity and/or direction of travel of the first vehicle may be determined with respect to positions of the first vehicle shown in the first and second images. Various time intervals between images may be utilized. For example, it may not be necessary to receive video data having a high frame rate (e.g., 30 frames per second or higher), because it may be preferable to determine motion or other properties of the first vehicle based on images that are taken at larger time intervals (e.g., one tenth of a second, one quarter of a second). In some embodiments, transmission bandwidth may be saved by transmitting and receiving reduced frame rate image streams.
At block 27.1101, the process performs determining a threat posed by the first vehicle to the user. As noted, the vehicular threat information may indicate a threat posed by the first vehicle to the user, such as that the first vehicle may collide with the user unless evasive action is taken.
At block 27.1201, the process performs determining a threat posed by the first vehicle to some other entity besides the user. As noted, the vehicular threat information may indicate a threat posed by the first vehicle to some other person or thing, such as that the first vehicle may collide with the other entity. The other entity may be a vehicle occupied by the user, a vehicle not occupied by the user, a pedestrian, a structure, or any other object that may come into proximity with the first vehicle.
At block 27.1301, the process performs determining a threat posed by a vehicle occupied by the user to the first vehicle. The vehicular threat information may indicate a threat posed by the user's vehicle (e.g., as a driver or passenger) to the first vehicle, such as because a collision may occur between the two vehicles.
At block 27.1401, the process performs determining a threat posed by a vehicle occupied by the user to some other entity besides the first vehicle. The vehicular threat information may indicate a threat posed by the user's vehicle to some other person or thing, such as due to a potential collision. The other entity may be some other vehicle, a pedestrian, a structure, or any other object that may come into proximity with the user's vehicle.
At block 27.1501, the process performs identifying the first vehicle in the image data. Image processing techniques may be employed to identify the presence of a vehicle, its type (e.g., car or truck), its size, license plate number, color, or other identifying information about the first vehicle.
At block 27.1601, the process performs determining whether the first vehicle is moving towards the user based on multiple images represented by the image data. In some embodiments, a video feed or other sequence of images may be analyzed to determine the relative motion of the first vehicle. For example, if the first vehicle appears to be becoming larger over a sequence of images, then it is likely that the first vehicle is moving towards the user.
At block 27.1701, the process performs determining motion-related information about the first vehicle, based on one or more images of the first vehicle. Motion-related information may include information about the mechanics (e.g., kinematics, dynamics) of the first vehicle, including position, velocity, direction of travel, acceleration, mass, or the like. Motion-related information may be determined for vehicles that are at rest. Motion-related information may be determined and expressed with respect to various frames of reference, including the user's frame of reference, the frame of reference of the first vehicle, a fixed frame of reference, or the like.
At block 27.1801, the process performs determining the motion-related information with respect to timestamps associated with the one or more images. In some embodiments, the received images include timestamps or other indicators that can be used to determine a time interval between the images. In other cases, the time interval may be known a priori or expressed in other ways, such as in terms of a frame rate associated with an image or video stream.
At block 27.1901, the process performs determining a position of the first vehicle. The position of the first vehicle may be expressed absolutely, such as via a GPS coordinate or similar representation, or relatively, such as with respect to the position of the user (e.g., 20 meters away from the first user). In addition, the position of the first vehicle may be represented as a point or collection of points (e.g., a region, arc, or line).
At block 27.2001, the process performs determining a velocity of the first vehicle. The process may determine the velocity of the first vehicle in absolute or relative terms (e.g., with respect to the velocity of the user). The velocity may be expressed or represented as a magnitude (e.g., 10 meters per second), a vector (e.g., having a magnitude and a direction), or the like.
At block 27.2101, the process performs determining the velocity with respect to a fixed frame of reference. In some embodiments, a fixed, global, or absolute frame of reference may be utilized.
At block 27.2201, the process performs determining the velocity with respect to a frame of reference of the user. In some embodiments, velocity is expressed with respect to the user's frame of reference. In such cases, a stationary (e.g., parked) vehicle will appear to be approaching the user if the user is driving towards the first vehicle.
At block 27.2301, the process performs determining a direction of travel of the first vehicle. The process may determine a direction in which the first vehicle is traveling, such as with respect to the user and/or some absolute coordinate system or frame of reference.
At block 27.2401, the process performs determining acceleration of the first vehicle. In some embodiments, acceleration of the first vehicle may be determined, for example by determining a rate of change of the velocity of the first vehicle observed over time.
At block 27.2501, the process performs determining mass of the first vehicle. Mass of the first vehicle may be determined in various ways, including by identifying the type of the first vehicle (e.g., car, truck, motorcycle), determining the size of the first vehicle based on its appearance in an image, or the like.
At block 27.2601, the process performs determining that the first vehicle is driving erratically. The first vehicle may be driving erratically for a number of reasons, including due to a medical condition (e.g., a heart attack, bad eyesight, shortness of breath), drug/alcohol impairment, distractions (e.g., text messaging, crying children, loud music), or the like.
At block 27.2701, the process performs determining that the first vehicle is driving with excessive speed. Excessive speed may be determined relatively, such as with respect to the average traffic speed on a road segment, posted speed limit, or the like. For example, a vehicle may be determined to be driving with excessive speed if the vehicle is driving more than 20% over the posted speed limit. Other thresholds (e.g., 10% over, 25% over) and/or baselines (e.g., average observed speed) are contemplated.
At block 27.2801, the process performs identifying objects other than the first vehicle in the image data. Image processing techniques may be employed by the process to identify other objects of interest, including road hazards (e.g., utility poles, ditches, drop-offs), pedestrians, other vehicles, or the like.
At block 27.2901, the process performs determining driving conditions based on the image data. Image processing techniques may be employed by the process to determine driving conditions, such as surface conditions (e.g., icy, wet), lighting conditions (e.g., glare, darkness), or the like.
At block 27.3001, the process performs determining vehicular threat information that is not related to the first vehicle. The process may determine vehicular threat information that is not due to the first vehicle, including based on a variety of other factors or information, such as driving conditions, the presence or absence of other vehicles, the presence or absence of pedestrians, or the like.
At block 27.3101, the process performs receiving and processing image data that includes images of objects and/or conditions aside from the first vehicle. At least some of the received image data may include images of things other than the first vehicle, such as other vehicles, pedestrians, driving conditions, and the like.
At block 27.3201, the process performs receiving image data of at least one of a stationary object, a pedestrian, and/or an animal. A stationary object may be a fence, guardrail, utility pole, building, parked vehicle, or the like.
At block 27.3301, the process performs processing the image data to determine the vehicular threat information that is not related to the first vehicle. For example, the process may determine that a difficult lighting condition exists due to glare or overexposure detected in the image data. As another example, the process may identify a pedestrian in the roadway depicted in the image data. As another example, the process may determine that poor road surface conditions exist.
At block 27.3401, the process performs processing data other than the image data to determine the vehicular threat information that is not related to the first vehicle. The process may analyze data other than image data, such as weather data (e.g., temperature, precipitation), time of day, traffic information, position or motion sensor information (e.g., obtained from GPS systems or accelerometers), or the like.
At block 27.3501, the process performs determining that poor driving conditions exist. Poor driving conditions may include or be based on weather information (e.g., snow, rain, ice, temperature), time information (e.g., night or day), lighting information (e.g., a light sensor indicating that the user is traveling towards the setting sun), or the like.
At block 27.3601, the process performs determining that a limited visibility condition exists. Limited visibility may be due to the time of day (e.g., at dusk, dawn, or night), weather (e.g., fog, rain), or the like.
At block 27.3701, the process performs determining that there is slow traffic in proximity to the user. The process may receive and integrate information from traffic information systems (e.g., that report accidents), other vehicles (e.g., that are reporting their speeds), or the like.
At block 27.3801, the process performs receiving information from a traffic information system regarding traffic congestion on a road traveled by the user. Traffic information systems may provide fine-grained traffic information, such as current average speeds measured on road segments in proximity to the user.
At block 27.3901, the process performs determining that one or more vehicles are traveling slower than an average or posted speed for a road traveled by the user. Slow travel may be determined based on the speed of one or more vehicles with respect to various baselines, such as average observed speed (e.g., recorded over time, based on time of day, etc.), posted speed limits, recommended speeds based on conditions, or the like.
At block 27.4001, the process performs determining that poor surface conditions exist on a roadway traveled by the user. Poor surface conditions may be due to weather (e.g., ice, snow, rain), temperature, surface type (e.g., gravel road), foreign materials (e.g., oil), or the like.
At block 27.4101, the process performs determining that there is a pedestrian in proximity to the user. The presence of pedestrians may be determined in various ways. In some embodiments, the process may utilize image processing techniques to recognize pedestrians in received image data. In other embodiments pedestrians may wear devices that transmit their location and/or presence. In other embodiments, pedestrians may be detected based on their heat signature, such as by an infrared sensor on the wearable device, user vehicle, or the like.
At block 27.4201, the process performs determining that there is an accident in proximity to the user. Accidents may be identified based on traffic information systems that report accidents, vehicle-based systems that transmit when collisions have occurred, or the like.
At block 27.4301, the process performs determining that there is an animal in proximity to the user. The presence of an animal may be determined as discussed with respect to pedestrians, above.
At block 27.4401, the process performs determining the vehicular threat information based on motion-related information that is not based on images of the first vehicle. The process may consider a variety of motion-related information received from various sources, such as the wearable device, a vehicle of the user, the first vehicle, or the like. The motion-related information may include information about the mechanics (e.g., position, velocity, acceleration, mass) of the user and/or the first vehicle.
At block 27.4501, the process performs determining the vehicular threat information based on information about position, velocity, and/or acceleration of the user obtained from sensors in the wearable device. The wearable device may include position sensors (e.g., GPS), accelerometers, or other devices configured to provide motion-related information about the user to the process.
At block 27.4601, the process performs determining the vehicular threat information based on information about position, velocity, and/or acceleration of the user obtained from devices in a vehicle of the user. A vehicle occupied or operated by the user may include position sensors (e.g., GPS), accelerometers, speedometers, or other devices configured to provide motion-related information about the user to the process.
At block 27.4701, the process performs determining the vehicular threat information based on information about position, velocity, and/or acceleration of the first vehicle obtained from devices of the first vehicle. The first vehicle may include position sensors (e.g., GPS), accelerometers, speedometers, or other devices configured to provide motion-related information about the user to the process. In other embodiments, motion-related information may be obtained from other sources, such as a radar gun deployed at the side of a road, from other vehicles, or the like.
At block 27.4801, the process performs determining the vehicular threat information based on gaze information associated with the user. In some embodiments, the process may consider the direction in which the user is looking when determining the vehicular threat information. For example, the vehicular threat information may depend on whether the user is or is not looking at the first vehicle, as discussed further below.
At block 27.4901, the process performs receiving an indication of a direction in which the user is looking. In some embodiments, an orientation sensor such as a gyroscope or accelerometer may be employed to determine the orientation of the user's head, face, or other body part. In some embodiments, a camera or other image sensing device may track the orientation of the user's eyes.
At block 27.4902, the process performs determining that the user is not looking towards the first vehicle. As noted, the process may track the position of the first vehicle. Given this information, coupled with information about the direction of the user's gaze, the process may determine whether or not the user is (or likely is) looking in the direction of the first vehicle.
At block 27.4903, the process performs in response to determining that the user is not looking towards the first vehicle, directing the user to look towards the first vehicle. When it is determined that the user is not looking at the first vehicle, the process may warn or otherwise direct the user to look in that direction, such as by saying or otherwise presenting “Look right!”, “Car on your left,” or similar message.
At block 27.5001, the process performs identifying multiple threats to the user. The process may in some cases identify multiple potential threats, such as one car approaching the user from behind and another car approaching the user from the left.
At block 27.5002, the process performs identifying a first one of the multiple threats that is more significant than at least one other of the multiple threats. The process may rank, order, or otherwise evaluate the relative significance or risk presented by each of the identified threats. For example, the process may determine that a truck approaching from the right is a bigger risk than a bicycle approaching from behind. On the other hand, if the truck is moving very slowly (thus leaving more time for the truck and/or the user to avoid it) compared to the bicycle, the process may instead determine that the bicycle is the bigger risk.
At block 27.5003, the process performs instructing the user to avoid the first one of the multiple threats. Instructing the user may include outputting a command or suggestion to take (or not take) a particular course of action.
At block 27.5101, the process performs modeling multiple potential accidents that each correspond to one of the multiple threats to determine a collision force associated with each accident. In some embodiments, the process models the physics of various objects to determine potential collisions and possibly their severity and/or likelihood. For example, the process may determine an expected force of a collision based on factors such as object mass, velocity, acceleration, deceleration, or the like.
At block 27.5102, the process performs selecting the first threat based at least in part on which of the multiple accidents has the highest collision force. In some embodiments, the process considers the threat having the highest associated collision force when determining most significant threat, because that threat will likely result in the greatest injury to the user.
At block 27.5201, the process performs determining a likelihood of an accident associated with each of the multiple threats. In some embodiments, the process associates a likelihood (probability) with each of the multiple threats. Such a probability may be determined with respect to a physical model that represents uncertainty with respect to the mechanics of the various objects that it models.
At block 27.5202, the process performs selecting the first threat based at least in part on which of the multiple threats has the highest associated likelihood. The process may consider the threat having the highest associated likelihood when determining the most significant threat.
At block 27.5301, the process performs determining a mass of an object associated with each of the multiple threats. In some embodiments, the process may consider the mass of threat objects, based on the assumption that those objects having higher mass (e.g., a truck) pose greater threats than those having a low mass (e.g., a pedestrian).
At block 27.5302, the process performs selecting the first threat based at least in part on which of the objects has the highest mass.
At block 27.5401, the process performs selecting the most significant threat from the multiple threats.
At block 27.5501, the process performs determining that an evasive action with respect to the first vehicle poses a threat to some other object. The process may consider whether potential evasive actions pose threats to other objects. For example, the process may analyze whether directing the user to turn right would cause the user to collide with a pedestrian or some fixed object, which may actually result in a worse outcome (e.g., for the user and/or the pedestrian) than colliding with the first vehicle.
At block 27.5502, the process performs instructing the user to take some other evasive action that poses a lesser threat to the some other object. The process may rank or otherwise order evasive actions (e.g., slow down, turn left, turn right) based at least in part on the risks or threats those evasive actions pose to other entities.
At block 27.5601, the process performs identifying multiple threats that each have an associated likelihood and cost. In some embodiments, the process may perform a cost-minimization analysis, in which it considers multiple threats, including threats posed to the user and to others, and selects a threat that minimizes or reduces expected costs. The process may also consider threats posed by actions taken by the user to avoid other threats.
At block 27.5602, the process performs determining a course of action that minimizes an expected cost with respect to the multiple threats. Expected cost of a threat may be expressed as a product of the likelihood of damage associated with the threat and the cost associated with such damage.
At block 27.5801, the process performs identifying multiple threats that are each related to different persons or things. In some embodiments, the process considers risks related to multiple distinct entities, possibly including the user.
At block 27.5901, the process performs identifying multiple threats that are each related to the user. In some embodiments, the process also or only considers risks that are related to the user.
At block 27.6001, the process performs minimizing expected costs to the user posed by the multiple threats. In some embodiments, the process attempts to minimize those costs borne by the user. Note that this may cause the process to recommend a course of action that is not optimal from a societal perspective, such as by directing the user to drive his car over a pedestrian rather than to crash into a car or structure.
At block 27.6101, the process performs minimizing overall expected costs posed by the multiple threats, the overall expected costs being a sum of expected costs borne by the user and other persons/things. In some embodiments, the process attempts to minimize social costs, that is, the costs borne by the various parties to an accident. Note that this may cause the process to recommend a course of action that may have a high cost to the user (e.g., crashing into a wall and damaging the user's car) to spare an even higher cost to another person (e.g., killing a pedestrian).
At block 27.6201, the process performs presenting the vehicular threat information via an audio output device of the wearable device. The process may play an alarm, bell, chime, voice message, or the like that warns or otherwise informs the user of the vehicular threat information. The wearable device may include audio speakers operable to output audio signals, including as part of a set of earphones, earbuds, a headset, a helmet, or the like.
At block 27.6301, the process performs presenting the vehicular threat information via a visual display device of the wearable device. In some embodiments, the wearable device includes a display screen or other mechanism for presenting visual information. For example, when the wearable device is a helmet, a face shield of the helmet may be used as a type of heads-up display for presenting the vehicular threat information.
At block 27.6401, the process performs displaying an indicator that instructs the user to look towards the first vehicle. The displayed indicator may be textual (e.g., “Look right!”), iconic (e.g., an arrow), or the like.
At block 27.6501, the process performs displaying an indicator that instructs the user to accelerate, decelerate, and/or turn. An example indicator may be or include the text “Speed up,” “slow down,” “turn left,” or similar language.
At block 27.6601, the process performs directing the user to accelerate.
At block 27.6701, the process performs directing the user to decelerate.
At block 27.6801, the process performs directing the user to turn.
At block 27.6901, the process performs transmitting to the first vehicle a warning based on the vehicular threat information. The process may send or otherwise transmit a warning or other message to the first vehicle that instructs the operator of the first vehicle to take evasive action. The instruction to the first vehicle may be complimentary to any instructions given to the user, such that if both instructions are followed, the risk of collision decreases. In this manner, the process may help avoid a situation in which the user and the operator of the first vehicle take actions that actually increase the risk of collision, such as may occur when the user and the first vehicle are approaching head but do not turn away from one another.
At block 27.7001, the process performs presenting the vehicular threat information via an output device of a vehicle of the user, the output device including a visual display and/or an audio speaker. In some embodiments, the process may use other devices to output the vehicular threat information, such as output devices of a vehicle of the user, including a car stereo, dashboard display, or the like.
At block 27.7401, the process performs presenting the vehicular threat information via goggles worn by the user. The goggles may include a small display, an audio speaker, or haptic output device, or the like.
At block 27.7501, the process performs presenting the vehicular threat information via a helmet worn by the user. The helmet may include an audio speaker or visual output device, such as a display that presents information on the inside of the face screen of the helmet. Other output devices, including haptic devices, are contemplated.
At block 27.7601, the process performs presenting the vehicular threat information via a hat worn by the user. The hat may include an audio speaker or similar output device.
At block 27.7701, the process performs presenting the vehicular threat information via eyeglasses worn by the user. The eyeglasses may include a small display, an audio speaker, or haptic output device, or the like.
At block 27.7801, the process performs presenting the vehicular threat information via audio speakers that are part of at least one of earphones, a headset, earbuds, and/or a hearing aid. The audio speakers may be integrated into the wearable device. In other embodiments, other audio speakers (e.g., of a car stereo) may be employed instead or in addition.
At block 27.7901, the process performs performing the receiving image data, the determining vehicular threat information, and/or the presenting the vehicular threat information on a computing device in the wearable device of the user. In some embodiments, a computing device of or in the wearable device may be responsible for performing one or more of the operations of the process. For example, a computing device situated within a helmet worn by the user may receive and analyze audio data to determine and present the vehicular threat information to the user.
At block 27.8001, the process performs performing the receiving image data, the determining vehicular threat information, and/or the presenting the vehicular threat information on a road-side computing system. In some embodiments, an in-situ computing system may be responsible for performing one or more of the operations of the process. For example, a computing system situated at or about a street intersection may receive and analyze audio signals of vehicles that are entering or nearing the intersection. Such an architecture may be beneficial when the wearable device is a “thin” device that does not have sufficient processing power to, for example, determine whether the first vehicle is approaching the user.
At block 27.8002, the process performs transmitting the vehicular threat information from the road-side computing system to the wearable device of the user. For example, when the road-side computing system determines that two vehicles may be on a collision course, the computing system can transmit vehicular threat information to the wearable device so that the user can take evasive action and avoid a possible accident.
At block 27.8101, the process performs performing the receiving image data, the determining vehicular threat information, and/or the presenting the vehicular threat information on a computing system in the first vehicle. In some embodiments, a computing system in the first vehicle performs one or more of the operations of the process. Such an architecture may be beneficial when the wearable device is a “thin” device that does not have sufficient processing power to, for example, determine whether the first vehicle is approaching the user.
At block 27.8102, the process performs transmitting the vehicular threat information from the computing system to the wearable device of the user.
At block 27.8201, the process performs performing the receiving image data, the determining vehicular threat information, and/or the presenting the vehicular threat information on a computing system in a second vehicle, wherein the user is not traveling in the second vehicle. In some embodiments, other vehicles that are not carrying the user and are not the same as the first user may perform one or more of the operations of the process. In general, computing systems/devices situated in or at multiple vehicles, wearable devices, or fixed stations in a roadway may each perform operations related to determining vehicular threat information, which may then be shared with other users and devices to improve traffic flow, avoid collisions, and generally enhance the abilities of users of the roadway.
At block 27.8202, the process performs transmitting the vehicular threat information from the computing system to the wearable device of the user.
At block 27.8301, the process performs receiving data representing an audio signal emitted by the first vehicle. The data representing the audio signal may be raw audio samples, compressed audio data, frequency coefficients, or the like. The data representing the audio signal may represent the sound made by the first vehicle, such as from its engine, a horn, tires, or any other source of sound. The data representing the audio signal may include sounds from other sources, including other vehicles, pedestrians, or the like. The audio signal may be obtained at or about a user who is a pedestrian or who is in a vehicle that is not the first vehicle, either as the operator or a passenger.
At block 27.8302, the process performs determining the vehicular threat information based further on the data representing the audio signal. As discussed further below, determining the vehicular threat information based on audio may include acoustic source localization, frequency analysis, or other techniques that can identify the presence, position, or motion of objects.
At block 27.8401, the process performs receiving data obtained at a microphone array that includes multiple microphones. In some embodiments, a microphone array having two or more microphones is employed to receive audio signals. Differences between the received audio signals may be utilized to perform acoustic source localization or other functions, as discussed further herein.
At block 27.8501, the process performs receiving data obtained at a microphone array, the microphone array coupled to a vehicle of the user. In some embodiments, such as when the user is operating or otherwise traveling in a vehicle of his own (that is not the same as the first vehicle), the microphone array may be coupled or attached to the user's vehicle, such as by having a microphone located at each of the four corners of the user's vehicle.
At block 27.8601, the process performs receiving data obtained at a microphone array, the microphone array coupled to the wearable device. For example, if the wearable device is a helmet, then a first microphone may be located on the left side of the helmet while a second microphone may be located on the right side of the helmet.
At block 27.8701, the process performs performing acoustic source localization to determine a position of the first vehicle based on multiple audio signals received via multiple microphones. The process may determine a position of the first vehicle by analyzing audio signals received via multiple distinct microphones. For example, engine noise of the first vehicle may have different characteristics (e.g., in volume, in time of arrival, in frequency) as received by different microphones. Differences between the audio signal measured at different microphones may be exploited to determine one or more positions (e.g., points, arcs, lines, regions) at which the first vehicle may be located.
At block 27.8801, the process performs receiving an audio signal via a first one of the multiple microphones, the audio signal representing a sound created by the first vehicle. In one approach, at least two microphones are employed. By measuring differences in the arrival time of an audio signal at the two microphones, the position of the first vehicle may be determined. The determined position may be a point, a line, an area, or the like.
At block 27.8802, the process performs receiving the audio signal via a second one of the multiple microphones.
At block 27.8803, the process performs determining the position of the first vehicle by determining a difference between an arrival time of the audio signal at the first microphone and an arrival time of the audio signal at the second microphone. In some embodiments, given information about the distance between the two microphones and the speed of sound, the process may determine the respective distances between each of the two microphones and the first vehicle. Given these two distances (along with the distance between the microphones), the process can solve for the one or more positions at which the first vehicle may be located.
At block 27.8901, the process performs triangulating the position of the first vehicle based on a first and second angle, the first angle measured between a first one of the multiple microphones and the first vehicle, the second angle measured between a second one of the multiple microphones and the first vehicle. In some embodiments, the microphones may be directional, in that they may be used to determine the direction from which the sound is coming. Given such information, the process may use triangulation techniques to determine the position of the first vehicle.
At block 27.9001, the process performs performing a Doppler analysis of the data representing the audio signal to determine whether the first vehicle is approaching the user. The process may analyze whether the frequency of the audio signal is shifting in order to determine whether the first vehicle is approaching or departing the position of the user. For example, if the frequency is shifting higher, the first vehicle may be determined to be approaching the user. Note that the determination is typically made from the frame of reference of the user (who may be moving or not). Thus, the first vehicle may be determined to be approaching the user when, as viewed from a fixed frame of reference, the user is approaching the first vehicle (e.g., a moving user traveling towards a stationary vehicle) or the first vehicle is approaching the user (e.g., a moving vehicle approaching a stationary user). In other embodiments, other frames of reference may be employed, such as a fixed frame, a frame associated with the first vehicle, or the like.
At block 27.9101, the process performs determining whether frequency of the audio signal is increasing or decreasing.
At block 27.9201, the process performs performing a volume analysis of the data representing the audio signal to determine whether the first vehicle is approaching the user. The process may analyze whether the volume (e.g., amplitude) of the audio signal is shifting in order to determine whether the first vehicle is approaching or departing the position of the user. As noted, different embodiments may use different frames of reference when making this determination.
At block 27.9301, the process performs determining whether volume of the audio signal is increasing or decreasing.
At block 27.9401, the process performs receiving data representing the first vehicle obtained at a road-based device. In some embodiments, the process may also consider data received from devices that are located in or about the roadway traveled by the user. Such devices may include cameras, loop coils, motion sensors, and the like.
At block 27.9402, the process performs determining the vehicular threat information based further on the data representing the first vehicle. For example, the process may determine that a car is approaching the user by analyzing an image taken from a camera that is mounted on or near a traffic signal over an intersection. As another example, the process may determine the speed of a vehicle with reference to data obtained from a radar gun/detector.
At block 27.9501, the process performs receiving the data from a sensor deployed at an intersection. Various types of sensors are contemplated, including cameras, range sensors (e.g., sonar, radar, LIDAR, IR-based), magnetic coils, audio sensors, or the like.
At block 27.9601, the process performs receiving an image of the first vehicle from a camera deployed at an intersection. For example, the process may receive images from a camera that is fixed to a traffic light or other signal at an intersection.
At block 27.9701, the process performs receiving ranging data from a range sensor deployed at an intersection, the ranging data representing a distance between the first vehicle and the intersection. For example, the process may receive a distance (e.g., 75 meters) measured between some known point in the intersection (e.g., the position of the range sensor) and an oncoming vehicle.
At block 27.9801, the process performs receiving data from an induction loop deployed in a road surface, the induction loop configured to detect the presence and/or velocity of the first vehicle. Induction loops may be embedded in the roadway and configured to detect the presence of vehicles passing over them. Some types of loops and/or processing may be employed to detect other information, including velocity, vehicle size, and the like.
At block 27.9901, the process performs identifying the first vehicle in an image obtained from the road-based sensor. Image processing techniques may be employed to identify the presence of a vehicle, its type (e.g., car or truck), its size, or other information.
At block 27.10001, the process performs determining a trajectory of the first vehicle based on multiple images obtained from the road-based device. In some embodiments, a video feed or other sequence of images may be analyzed to determine the position, speed, and/or direction of travel of the first vehicle.
At block 27.10101, the process performs receiving data representing vehicular threat information relevant to a second vehicle, the second vehicle not being used for travel by the user. As noted, vehicular threat information may in some embodiments be shared amongst vehicles and entities present in a roadway. For example, a vehicle that is traveling just ahead of the user may determine that it is threatened by the first vehicle. This information may be shared with the user so that the user can also take evasive action, such as by slowing down or changing course.
At block 27.10102, the process performs determining the vehicular threat information based on the data representing vehicular threat information relevant to the second vehicle. Having received vehicular threat information from the second vehicle, the process may determine that it is also relevant to the user, and then accordingly present it to the user.
At block 27.10201, the process performs receiving from the second vehicle an indication of stalled or slow traffic encountered by the second vehicle. Various types of threat information relevant to the second vehicle may be provided to the process, such as that there is stalled or slow traffic ahead of the second vehicle.
At block 27.10301, the process performs receiving from the second vehicle an indication of poor driving conditions experienced by the second vehicle. The second vehicle may share the fact that it is experiencing poor driving conditions, such as an icy or wet roadway.
At block 27.10401, the process performs receiving from the second vehicle an indication that the first vehicle is driving erratically. The second vehicle may share a determination that the first vehicle is driving erratically, such as by swerving, driving with excessive speed, driving too slowly, or the like.
At block 27.10501, the process performs receiving from the second vehicle an image of the first vehicle. The second vehicle may include one or more cameras, and may share images obtained via those cameras with other entities.
At block 27.10601, the process performs transmitting the vehicular threat information to a second vehicle. As noted, vehicular threat information may in some embodiments be shared amongst vehicles and entities present in a roadway. In this example, the vehicular threat information is transmitted to a second vehicle (e.g., one following behind the user), so that the second vehicle may benefit from the determined vehicular threat information as well.
At block 27.10701, the process performs transmitting the vehicular threat information to an intermediary server system for distribution to other vehicles in proximity to the user. In some embodiments, intermediary systems may operate as relays for sharing the vehicular threat information with other vehicles and users of a roadway.
At block 27.10801, the process performs transmitting the vehicular threat information to a law enforcement entity. In some embodiments, the process shares the vehicular threat information with law enforcement entities, including computer or other information systems managed or operated by such entities. For example, if the process determines that the first vehicle is driving erratically, the process may transmit that determination and/or information about the first vehicle with the police.
At block 27.10901, the process performs determining a license place identifier of the first vehicle based on the image data. The process may perform image processing (e.g., optical character recognition) to determine the license number on the license plate of the first vehicle.
At block 27.10902, the process performs transmitting the license plate identifier to the law enforcement entity.
At block 27.11001, the process performs determining a vehicle description of the first vehicle based on the image data. Image processing may be utilized to determine a vehicle description, including one or more of type, make, year, and/or color of the first vehicle.
At block 27.11002, the process performs transmitting the vehicle description to the law enforcement entity.
At block 27.11101, the process performs determining a location associated with the first vehicle. The process may reference a GPS system to determine the current location of the user and/or the first vehicle, and then provide an indication of that location to the police or other agency. The location may be or include a coordinate, a street or intersection name, a name of a municipality, or the like.
At block 27.11102, the process performs transmitting an indication of the location to the law enforcement entity.
At block 27.11201, the process performs determining a direction of travel of the first vehicle. As discussed above, the process may determine direction of travel in various ways, such as by modeling the motion of the first vehicle. Such a direction may then be provided to the police or other agency, such as by reporting that the first vehicle is traveling northbound.
At block 27.11202, the process performs transmitting an indication of the direction of travel to the law enforcement entity.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 25.100. In addition, the computing system 28.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 25.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 28.400 comprises a computer memory (“memory”) 28.401, a display 28.402, one or more Central Processing Units (“CPU”) 28.403, Input/Output devices 28.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 28.405, and network connections 28.406. The AEFS 25.100 is shown residing in memory 28.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 25.100 may be stored on and/or transmitted over the other computer-readable media 28.405. The components of the AEFS 25.100 preferably execute on one or more CPUs 28.403 and implement techniques described herein. Other code or programs 28.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 28.420, also reside in the memory 28.401, and preferably execute on one or more CPUs 28.403. Of note, one or more of the components in
The AEFS 25.100 interacts via the network 28.450 with wearable devices 25.120, information sources 25.130, and third-party systems/applications 28.455. The network 28.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 28.455 may include any systems that provide data to, or utilize data from, the AEFS 25.100, including Web browsers, vehicle-based client systems, traffic tracking, monitoring, or prediction systems, and the like.
The AEFS 25.100 is shown executing in the memory 28.401 of the computing system 28.400. Also included in the memory are a user interface manager 28.415 and an application program interface (“API”) 28.416. The user interface manager 28.415 and the API 28.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 25.100.
The UI manager 28.415 provides a view and a controller that facilitate user interaction with the AEFS 25.100 and its various components. For example, the UI manager 28.415 may provide interactive access to the AEFS 25.100, such that users can configure the operation of the AEFS 25.100, such as by providing the AEFS 25.100 with information about common routes traveled, vehicle types used, driving patterns, or the like. The UI manager 28.415 may also manage and/or implement various output abstractions, such that the AEFS 25.100 can cause vehicular threat information to be displayed on different media, devices, or systems. In some embodiments, access to the functionality of the UI manager 28.415 may be provided via a Web server, possibly executing as one of the other programs 28.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 28.455 can interact with the AEFS 25.100 via the UI manager 28.415.
The API 28.416 provides programmatic access to one or more functions of the AEFS 25.100. For example, the API 28.416 may provide a programmatic interface to one or more functions of the AEFS 25.100 that may be invoked by one of the other programs 28.430 or some other module. In this manner, the API 28.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 25.100 into vehicle-based client systems or devices), and the like.
In addition, the API 28.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the wearable devices 25.120, information sources 25.130, and/or one of the third-party systems/applications 28.455, to access various functions of the AEFS 25.100. For example, an information source 25.130 such as a radar gun installed at an intersection may push motion-related information (e.g., velocity) about vehicles to the AEFS 25.100 via the API 28.416. As another example, a weather information system may push current conditions information (e.g., temperature, precipitation) to the AEFS 25.100 via the API 28.416. The API 28.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 28.455 and that are configured to interact with the AEFS 25.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 25.100 are implemented using standard programming techniques. For example, the AEFS 25.100 may be implemented as a “native” executable running on the CPU 28.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 25.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 28.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 25.100, such as in the data store 28.420 (or 26.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 28.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 25.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for ability enhancement and, more particularly, for enhancing a user's ability to operate or function in a transportation-related context (e.g., as a pedestrian or vehicle operator) by performing threat detection based at least in part on analyzing information received from road-based devices, such as a camera, microphone, or other sensor deployed at the side of a road, at an intersection, or other road-based location. The received information may include image data, audio data, or other data/signals that represent vehicles and other objects or conditions present in a roadway or other context. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”) that performs at least some of the described techniques. Embodiments of the AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities (e.g., driving, riding a bike, walking/running) of a user.
In some embodiments, the AEFS is configured to identify threats (e.g., posed by vehicles to a user of a roadway, posed by a user to vehicles or other users of a roadway), and to provide information about such threats to the user so that he may take evasive action. Identifying threats may include analyzing information about a vehicle that is present in the roadway in order to determine whether the user and the vehicle may be on a collision course. The analyzed information may include or be represented by image data (e.g., pictures or video of a roadway and its surrounding environment), audio data (e.g., sounds reflected from or emitted by a vehicle), range information (e.g., provided by a sonar or infrared range sensor), conditions information (e.g., weather, temperature, time of day), or the like. The user may be a pedestrian (e.g., a walker, a jogger), an operator of a motorized (e.g., car, motorcycle, moped, scooter) or non-motorized vehicle (e.g., bicycle, pedicab, rickshaw), a vehicle passenger, or the like. In some embodiments, the vehicle may be operating autonomously. In some embodiments, the user wears a wearable device (e.g., a helmet, goggles, eyeglasses, hat) that is configured to at least present determined vehicular threat information to the user.
The AEFS may determine threats based on information received from various sources. Road-based sources may provide image, audio, or other types of data to the AEFS. The road-based sources may include sensors, devices, or systems that are deployed at, within, or about a roadway or intersection. For example, cameras, microphones, range sensors, velocity sensors, and the like may be affixed to utility or traffic signal support structures (e.g., poles, posts). As another example, induction coils embedded within a road can provide information to the AEFS about the presence and/or velocity of vehicles traveling over the road.
In some embodiments, the AEFS is configured to receive image data, at least some of which represents an image of a first vehicle. The image data may be obtained from various sources, including a camera of a wearable device of a user, a camera on a vehicle of the user, a road-side camera, a camera on some other vehicle, or the like. The image data may represent electromagnetic signals of various types or in various ranges, including visual signals (e.g., signals having a wavelength in the range of about 390-750 nm), infrared signals (e.g., signals having a wavelength in the range of about 750 nm-300 micrometers), or the like.
Then, the AEFS determines vehicular threat information based at least in part on the image data. In some embodiments, the AEFS may analyze the received image data in order to identify the first vehicle and/or to determine whether the first vehicle represents a threat to the user, such as because the first vehicle and the user may be on a collision course. The image data may be analyzed in various ways, including by identifying objects (e.g., to recognize that a vehicle or some other object is shown in the image data), determining motion-related information (e.g., position, velocity, acceleration, mass) about objects, or the like.
Next, the AEFS informs the user of the determined vehicular threat information via a wearable device of the user. Typically, the user's wearable device (e.g., a helmet) will include one or more output devices, such as audio speakers, visual display devices (e.g., warning lights, screens, heads-up displays), haptic devices, and the like. The AEFS may present the vehicular threat information via one or more of these output devices. For example, the AEFS may visually display or speak the words “Car on left.” As another example, the AEFS may visually display a leftward pointing arrow on a heads-up screen displayed on a face screen of the user's helmet. Presenting the vehicular threat information may also or instead include presenting a recommended course of action (e.g., to slow down, to speed up, to turn) to mitigate the determined vehicular threat.
The AEFS may use other or additional sources or types of information. For example, in some embodiments, the AEFS is configured to receive data representing an audio signal emitted by a first vehicle. The audio signal is typically obtained in proximity to a user, who may be a pedestrian or traveling in a vehicle as an operator or a passenger. In some embodiments, the audio signal is obtained by one or more microphones coupled to a road-side structure, the user's vehicle and/or a wearable device of the user, such as a helmet, goggles, a hat, a media player, or the like. Then, the AEFS may determine vehicular threat information based at least in part on the data representing the audio signal. In some embodiments, the AEFS may analyze the received data in order to determine whether the first vehicle and the user are on a collision course. The audio data may be analyzed in various ways, including by performing audio analysis, frequency analysis (e.g., Doppler analysis), acoustic localization, or the like.
The AEFS may combine information of various types in order to determine threat information. For example, because image processing may be computationally expensive, rather than always processing all image data obtained from every possible source, the AEFS may use audio analysis to initially determine the approximate location of an oncoming vehicle, such as to the user's left, right, or rear. For example, having determined based on audio data that a vehicle may be approaching from the rear of the user, the AEFS may preferentially process image data from a rear-facing camera to further refine a threat analysis. As another example, the AEFS may incorporate information about the condition of a roadway (e.g., icy or wet) when determining whether a vehicle will be able to stop or maneuver in order to avoid an accident.
In this example, the moped 29.110a is driving towards the motorcycle 29.110b from a side street, at approximately a right angle with respect to the path of travel of the motorcycle 29.110b. The traffic signal 29.106 has just turned from red to green for the motorcycle 29.110b, and the user 29.104 is beginning to drive the motorcycle 29.110 into the intersection controlled by the traffic signal 29.106. The user 29.104 is assuming that the moped 29.110a will stop, because cross traffic will have a red light. However, in this example, the moped 29.110a may not stop in a timely manner, for one or more reasons, such as because the operator of the moped 29.110a has not seen the red light, because the moped 29.110a is moving at an excessive rate, because the operator of the moped 29.110a is impaired, because the surface conditions of the roadway are icy or slick, or the like. As will be discussed further below, the AEFS 29.100 will determine that the moped 29.110a and the motorcycle 29.110b are likely on a collision course, and inform the user 29.104 of this threat via the helmet 29.120a, so that the user may take evasive action to avoid a possible collision with the moped 29.110a.
The moped 29.110 emits or reflects a signal 29.101. In some embodiments, the signal 29.101 is an electromagnetic signal in the visible light spectrum that represents an image of the moped 29.110a. Other types of electromagnetic signals may be received and processed, including infrared radiation, radio waves, microwaves, or the like. Other types of signals are contemplated, including audio signals, such as an emitted engine noise, a reflected sonar signal, a vocalization (e.g., shout, scream), etc. The signal 29.101 may be received by a receiving detector/device/sensor, such as a camera or microphone (not shown) on the helmet 29.120a and/or the motorcycle 29.110b. In some embodiments, a computing and communication device within the helmet 29.120a receives and samples the signal 29.101 and transmits the samples or other representation to the AEFS 29.100. In other embodiments, other forms of data may be used to represent the signal 29.101, including frequency coefficients, compressed audio/video, or the like.
The AEFS 29.100 determines vehicular threat information by analyzing the received data that represents the signal 29.101. If the signal 29.101 is a visual signal, then the AEFS 29.100 may employ various image data processing techniques. For example, the AEFS 29.100 may perform object recognition to determine that received image data includes an image of a vehicle, such as the moped 29.110a. The AEFS 29.100 may also or instead process received image data to determine motion-related information with respect to the moped 29.110, including position, velocity, acceleration, or the like. The AEFS 29.100 may further identify the presence of other objects, including pedestrians, animals, structures, or the like, that may pose a threat to the user 29.104 or that may be themselves threatened (e.g., by actions of the user 29.104 and/or the moped 29.110a). Image processing also may be employed to determine other information, including road conditions (e.g., wet or icy roads), visibility conditions (e.g., glare or darkness), and the like.
If the signal 29.101 is an audio signal, then the AEFS 29.100 may use one or more audio analysis techniques to determine the vehicular threat information. In one embodiment, the AEFS 29.100 performs a Doppler analysis (e.g., by determining whether the frequency of the audio signal is increasing or decreasing) to determine that the object that is emitting the audio signal is approaching (and possibly at what rate) the user 29.104. In some embodiments, the AEFS 29.100 may determine the type of vehicle (e.g., a heavy truck, a passenger vehicle, a motorcycle, a moped) by analyzing the received data to identify an audio signature that is correlated with a particular engine type or size. For example, a lower frequency engine sound may be correlated with a larger vehicle size, and a higher frequency engine sound may be correlated with a smaller vehicle size.
In one embodiment, where the signal 29.101 is an audio signal, the AEFS 29.100 performs acoustic source localization to determine information about the trajectory of the moped 29.110a, including one or more of position, direction of travel, speed, acceleration, or the like. Acoustic source localization may include receiving data representing the audio signal 29.101 as measured by two or more microphones. For example, the helmet 29.120a may include four microphones (e.g., front, right, rear, and left) that each receive the audio signal 29.101. These microphones may be directional, such that they can be used to provide directional information (e.g., an angle between the helmet and the audio source). Such directional information may then be used by the AEFS 29.100 to triangulate the position of the moped 29.110a. As another example, the AEFS 29.100 may measure differences between the arrival time of the audio signal 29.101 at multiple distinct microphones on the helmet 29.120a or other location. The difference in arrival time, together with information about the distance between the microphones, can be used by the AEFS 29.100 to determine distances between each of the microphones and the audio source, such as the moped 29.110a. Distances between the microphones and the audio source can then be used to determine one or more locations at which the audio source may be located.
Determining vehicular threat information may also or instead include obtaining information such as the position, trajectory, and speed of the user 29.104, such as by receiving data representing such information from sensors, devices, and/or systems on board the motorcycle 29.110b and/or the helmet 29.120a. Such sources of information may include a speedometer, a geo-location system (e.g., GPS system), an accelerometer, or the like. Once the AEFS 29.100 has determined and/or obtained information such as the position, trajectory, and speed of the moped 29.110a and the user 29.104, the AEFS 29.100 may determine whether the moped 29.110a and the user 29.104 are likely to collide with one another. For example, the AEFS 29.100 may model the expected trajectories of the moped 29.110a and user 29.104 to determine whether they intersect at or about the same point in time.
The AEFS 29.100 may then present the determined vehicular threat information (e.g., that the moped 29.110a represents a hazard) to the user 29.104 via the helmet 29.120a. Presenting the vehicular threat information may include transmitting the information to the helmet 29.120a, where it is received and presented to the user. In one embodiment, the helmet 29.120a includes audio speakers that may be used to output an audio signal (e.g., an alarm or voice message) warning the user 29.104. In other embodiments, the helmet 29.120a includes a visual display, such as a heads-up display presented upon a face screen of the helmet 29.120a, which can be used to present a text message (e.g., “Look left”) or an icon (e.g., a red arrow pointing left).
As noted, the AEFS 29.100 may also use information received from road-based sensors and/or devices. For example, the AEFS 29.100 may use information received from a camera 29.108 that is mounted on the traffic signal 29.106 that controls the illustrated intersection. The AEFS 29.100 may receive image data that represents the moped 29.110a and/or the motorcycle 29.110b. The AEFS 29.100 may perform image recognition to determine the type and/or position of a vehicle that is approaching the intersection. The AEFS 29.100 may also or instead analyze multiple images (e.g., from a video signal) to determine the velocity of a vehicle. Other types of sensors or devices installed in or about a roadway may also or instead by used, including range sensors, speed sensors (e.g., radar guns), induction coils (e.g., loops mounted in the roadbed), temperature sensors, weather gauges, or the like.
As noted above, the AEFS 29.100 may utilize data that represents a signal as detected by one or more detectors/sensors, such as microphones or cameras. In the example of
In an image context, the AEFS 29.100 may perform image processing on image data obtained from one or more of the camera sensors 29.124a and 29.124b. As discussed, the image data may be processed to determine the presence of the moped, its type, its motion-related information (e.g., velocity), and the like. In some embodiments, image data may be processed without making any definite identification of a vehicle. For example, the AEFS 29.100 may process image data from sensors 29.124a and 29.124b to identify the presence of motion (without necessarily identifying any objects). Based on such an analysis, the AEFS 29.100 may determine that there is something approaching from the left of the motorcycle 29.110b, but that the right of the motorcycle 29.110b is relatively clear.
Differences between data obtained from multiple sensors may be exploited in various ways. In an image context, an image signal may be perceived or captured differently by the two (camera) sensors 29.124a and 29.124b. The AEFS 29.100 may exploit or otherwise analyze such differences to determine the location and/or motion of the moped 29.110a. For example, knowing the relative position and optical qualities of the two cameras, it is possible to analyze images captured by those cameras to triangulate a position of an object (e.g., the moped 29.110a) or a distance between the motorcycle 29.110b and the object.
In an audio context, an audio signal may be perceived differently by the two sensors 29.124a and 29.124b. For example, if the strength of the signal 29.101 is stronger as measured at microphone 29.124a than at microphone 29.124b, the AEFS 29.100 may infer that the signal 29.101 is originating from the driver's left of the motorcycle 29.110b, and thus that a vehicle is approaching from that direction. As another example, as the strength of an audio signal is known to decay with distance, and assuming an initial level (e.g., based on an average signal level of a vehicle engine) the AEFS 29.100 may determine a distance (or distance interval) between one or more of the microphones and the signal source.
The AEFS 29.100 may model vehicles and other objects, such as by representing their motion-related information, including position, speed, acceleration, mass and other properties. Such a model may then be used to determine whether objects are likely to collide. Note that the model may be probabilistic. For example the AEFS 29.100 may represent an object's position in space as a region that includes multiple positions that each have a corresponding likelihood that that the object is at that position. As another example, the AEFS 29.100 may represent the velocity of an object as a range of likely values, a probability distribution, or the like. Various frames of reference may be employed, including a user-centric frame, an absolute frame, or the like.
The AEFS 29.100 may interact with various types of wearable devices 29.120, including a motorcycle helmet 29.120a (
In some embodiments, a wearable device may perform some or all of the functions of the AEFS 29.100, even though the AEFS 29.100 is depicted as separate in these examples. Some devices may have minimal processing power and thus perform only some of the functions. For example, the eyeglasses 29.120b may receive vehicular threat information from a remote AEFS 29.100, and display it on a heads-up display displayed on the inside of the lenses of the eyeglasses 29.120b. Other wearable devices may have sufficient processing power to perform more of the functions of the AEFS 29.100. For example, the personal media device 29.120e may have considerable processing power and as such be configured to perform acoustic source localization, collision detection analysis, or other more computational expensive functions.
Note that the wearable devices 29.120 may act in concert with one another or with other entities to perform functions of the AEFS 29.100. For example, the eyeglasses 29.120b may include a display mechanism that receives and displays vehicular threat information determined by the personal media device 29.120e. As another example, the goggles 29.120c may include a display mechanism that receives and displays vehicular threat information determined by a computing device in the helmet 29.120a or 29.120d. In a further example, one of the wearable devices 29.120 may receive and process audio data received by microphones mounted on the vehicle 29.110c.
The AEFS 29.100 may also or instead interact with vehicles 29.110 and/or computing devices installed thereon. As noted, a vehicle 29.110 may have one or more sensors or devices that may operate as (direct or indirect) sources of information for the AEFS 29.100. The vehicle 29.110c, for example, may include a speedometer, an accelerometer, one or more microphones, one or more range sensors, or the like. Data obtained by, at, or from such devices of vehicle 29.110c may be forwarded to the AEFS 29.100, possibly by a wearable device 29.120 of an operator of the vehicle 29.110c.
In some embodiments, the vehicle 29.110c may itself have or use an AEFS, and be configured to transmit warnings or other vehicular threat information to others. For example, an AEFS of the vehicle 29.110c may have determined that the moped 29.110a was driving with excessive speed just prior to the scenario depicted in
The AEFS 29.100 may also or instead interact with sensors and other devices that are installed on, in, or about roads or in other transportation related contexts, such as parking garages, racetracks, or the like. In this example, the AEFS 29.100 interacts with the camera 29.108 to obtain images of vehicles, pedestrians, or other objects present in a roadway. Other types of sensors or devices may include range sensors, infrared sensors, induction coils, radar guns, temperature gauges, precipitation gauges, or the like.
The AEFS 29.100 may further interact with information systems that are not shown in
In some embodiments, the AEFS 29.100 may transmit information to law enforcement agencies and/or related computing systems. For example, if the AEFS 29.100 determines that a vehicle is driving erratically, it may transmit that fact along with information about the vehicle (e.g., make, model, color, license plate number, location) to a police computing system.
Note that in some embodiments, at least some of the described techniques may be performed without the utilization of any wearable devices 29.120. For example, a vehicle 29.110 may itself include the necessary computation, input, and output devices to perform functions of the AEFS 29.100. For example, the AEFS 29.100 may present vehicular threat information on output devices of a vehicle 29.110, such as a radio speaker, dashboard warning light, heads-up display, or the like. As another example, a computing device on a vehicle 29.110 may itself determine the vehicular threat information.
In some embodiments, the AEFS 29.100 processes the image 29.140 to perform object identification. Upon processing the image 29.140, the AEFS 29.100 may identify the moped 29.110a, the child 29.141, the sun 29.142, the puddle 29.143, and/or the roadway 29.144. A sequence of images, taken at different times (e.g., one tenth of a second apart) may be used to determine that the moped 29.110a is moving, how fast the moped 29.110a is moving, acceleration/deceleration of the moped 29.110a, or the like. Motion of other objects, such as the child 29.141 may also be tracked. Based on such motion-related information, the AEFS 29.100 may model the physics of the identified objects to determine whether a collision is likely.
Determining vehicular threat information may also or instead be based on factors related or relevant to objects other than the moped 29.110a or the user 29.104. For example, the AEFS 29.100 may determine that the puddle 29.143 will likely make it more difficult for the moped 29.110a to stop. Thus, even if the moped 29.110a is moving at a reasonable speed, he still may be unable to stop prior to entering the intersection due to the presence of the puddle 29.143. As another example, the AEFS 29.100 may determine that evasive action by the user 29.104 and/or the moped 29.110a may cause injury to the child 29.141. As a further example, the AEFS 29.100 may determine that it may be difficult for the user 29.104 to see the moped 29.110a and/or the child 29.141 due to the position of the sun 29.142. Such information may be incorporated into any models, predictions, or determinations made or maintained by the AEFS 29.100.
The scenario of
In this example, the AEFS 29.100 determines that the driver of the motorcycle 29.110b intends to make a left turn. This determination may be based on the fact that the motorcycle 29.110b is slowing down or has activated its turn signals. In some embodiments, when the driver activates a turn signal, an indication of the activation is transmitted to the AEFS 29.100. The AEFS 29.100 then receives information (e.g., image data) about the moped 29.110a from the camera 29.108 and possibly one or more other sources (e.g., a camera, microphone, or other device on the motorcycle 29.110b; a device on the moped 29.110a; a road-embedded device). By analyzing the image data, the AEFS 29.100 can estimate the motion-related information (e.g., position, speed, acceleration) about the moped 29.110a. Based on this motion-related information, the AEFS 29.100 can determine threat information such as whether the moped 29.110a is slowing to stop or instead attempting to speed through the intersection. The AEFS 29.100 can then inform the user of the determined threat information, as discussed further with respect to
The display 29.150 may be used by embodiments of the AEFS to present threat information to users. For example, as discussed with respect to the scenario of
The display 29.150 may be provided in various ways. In one embodiment, the display 29.150 is presented by a heads-up display provided by a vehicle, such as the motorcycle 29.110b, a car, truck, or the like, where the display is presented on the wind screen or other surface. In another embodiment, the display 29.150 may be presented by a heads-up display provided by a wearable device, such as goggles or a helmet, where the display 29.150 is presented on a face or eye shield. In another embodiment, the display 29.150 may be presented by an LCD or similar screen in a dashboard or other portion of a vehicle.
The threat analysis engine 30.210 includes an audio processor 30.212, an image processor 30.214, other sensor data processors 30.216, and an object tracker 30.218. In the illustrated example, the audio processor 30.212 processes audio data received from the wearable device 29.120. As noted, such data may be received from other sources as well or instead, including directly from a vehicle-mounted microphone, or the like. The audio processor 30.212 may perform various types of signal processing, including audio level analysis, frequency analysis, acoustic source localization, or the like. Based on such signal processing, the audio processor 30.212 may determine strength, direction of audio signals, audio source distance, audio source type, or the like. Outputs of the audio processor 30.212 (e.g., that an object is approaching from a particular angle) may be provided to the object tracker 30.218 and/or stored in the data store 30.240.
The image processor 30.214 receives and processes image data that may be received from sources such as the wearable device 29.120 and/or information sources 29.130. For example, the image processor 30.214 may receive image data from a camera of the wearable device 29.120, and perform object recognition to determine the type and/or position of a vehicle that is approaching the user 29.104. As another example, the image processor 30.214 may receive a video signal (e.g., a sequence or stream of images) and process them to determine the type, position, and/or velocity of a vehicle that is approaching the user 29.104. Multiple images may be processed to determine the presence or absence of motion, even if no object recognition is performed. Outputs of the image processor 30.214 (e.g., position and velocity information, vehicle type information) may be provided to the object tracker 30.218 and/or stored in the data store 30.240.
The other sensor data processor 30.216 receives and processes data received from other sensors or sources. For example, the other sensor data processor 30.216 may receive and/or determine information about the position and/or movements of the user and/or one or more vehicles, such as based on GPS systems, speedometers, accelerometers, or other devices. As another example, the other sensor data processor 30.216 may receive and process conditions information (e.g., temperature, precipitation) from the information sources 29.130 and determine that road conditions are currently icy. Outputs of the other sensor data processor 30.216 (e.g., that the user is moving at 5 miles per hour) may be provided to the object tracker 30.218 and/or stored in the data store 30.240.
The object tracker 30.218 manages a geospatial object model that includes information about objects known to the AEFS 29.100. The object tracker 30.218 receives and merges information about object types, positions, velocity, acceleration, direction of travel, and the like, from one or more of the processors 30.212, 30.214, 30.216, and/or other sources. Based on such information, the object tracker 30.218 may identify the presence of objects as well as their likely positions, paths, and the like. The object tracker 30.218 may continually update this model as new information becomes available and/or as time passes (e.g., by plotting a likely current position of an object based on its last measured position and trajectory). The object tracker 30.218 may also maintain confidence levels corresponding to elements of the geo-spatial model, such as a likelihood that a vehicle is at a particular position or moving at a particular velocity, that a particular object is a vehicle and not a pedestrian, or the like.
The agent logic 30.220 implements the core intelligence of the AEFS 29.100. The agent logic 30.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to determine vehicular threat information. For example, the agent logic 30.220 may combine information from the object tracker 30.218, such as that there is a determined likelihood of a collision at an intersection, with information from one of the information sources 29.130, such as that the intersection is the scene of common red-light violations, and decide that the likelihood of a collision is high enough to transmit a warning to the user 29.104. As another example, the agent logic 30.220 may, in the face of multiple distinct threats to the user, determine which threat is the most significant and cause the user to avoid the more significant threat, such as by not directing the user 29.104 to slam on the brakes when a bicycle is approaching from the side but a truck is approaching from the rear, because being rear-ended by the truck would have more serious consequences than being hit from the side by the bicycle.
The presentation engine 30.230 includes a visible output processor 30.232 and an audible output processor 30.234. The visible output processor 30.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the wearable device 29.120 or some other display (e.g., a heads-up display of a vehicle 29.110 being driven by the user 29.104). The agent logic 30.220 may use or invoke the visible output processor 30.232 to prepare and display information, such as by formatting or otherwise modifying vehicular threat information to fit on a particular type or size of display. The audible output processor 30.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 30.220 may use or invoke the audible output processor 30.234 in order to convert a textual message (e.g., a warning message, a threat identification) into audio output suitable for presentation via the wearable device 29.120, for example by employing a text-to-speech processor.
Note that one or more of the illustrated components/modules may not be present in some embodiments. For example, in embodiments that do not perform image or video processing, the AEFS 29.100 may not include an image processor 30.214. As another example, in embodiments that do not perform audio output, the AEFS 29.100 may not include an audible output processor 30.234.
Note also that the AEFS 29.100 may act in service of multiple users 29.104. In some embodiments, the AEFS 29.100 may determine vehicular threat information concurrently for multiple distinct users. Such embodiments may further facilitate the sharing of vehicular threat information. For example, vehicular threat information determined as between two vehicles may be relevant and thus shared with a third vehicle that is in proximity to the other two vehicles.
FIGS. 31.1-31.132 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 31.101, the process performs at a road-based device, receiving information about a first vehicle that is proximate to the road-based device. The process may receive various types of information about the first vehicle, including image data, audio data, motion-related information, and the like, as discussed further below. This information is received at a road-based device, which is typically a fixed device situated on, in, or about a roadway traveled by the first vehicle. Example devices include cameras, microphones, induction loops, radar guns, range sensors (e.g., sonar, radar, LIDAR, IR-based), and the like. The device may be fixed (permanently or removably) to a structure, such as a utility pole, a traffic control signal, a building, or the like. In other embodiments, the road-based device may instead or also be a mobile device, such as may be situated in the first vehicle, on the user's person, on a trailer parked by the side of a road, or the like.
At block 31.102, the process performs determining threat information based at least in part on the information about the first vehicle. Threat information may include information related to threats posed by the first vehicle (e.g., to the user or to some other entity), by a vehicle occupied by the user (e.g., to the first vehicle or to some other entity), or the like. Note that threats may be posed by vehicles to non-vehicles, including pedestrians, animals, structures, or the like. Threats may also include those threats posed by non-vehicles (e.g., structures, pedestrians) to vehicles. Threat information may be determined in various ways. For example, where the received information is image data, the process may analyze the image data to identify objects, such as vehicles, pedestrians, fixed objects, and the like. In some embodiments, determining the threat information may also or instead include determining motion-related information about identified objects, including position, velocity, direction of travel, accelerations, or the like. In some embodiments, the received information is motion-related information that is transmitted by vehicles traveling about the roadway. Determining the threat information may also or instead include predicting whether the path of the user and one or more identified objects may intersect. These and other variations are discussed further below.
At block 31.103, the process performs presenting the threat information via a wearable device of a user. The determined threat information may be presented in various ways, such as by presenting an audible or visible warning or other indication that the first vehicle is approaching the user. Different types of wearable devices are contemplated, including helmets, eyeglasses, goggles, hats, and the like. In other embodiments, the threat information may also or instead be presented in other ways, such as via an output device on a vehicle of the user, in-situ output devices (e.g., traffic signs, road-side speakers), or the like. In some embodiments, the process may cause traffic control signals or devices to automatically change state, such as by changing a traffic light from green to red to inhibit cars from entering an intersection.
At block 31.201, the process performs determining a threat posed by the first vehicle to the user. As noted, the threat information may indicate a threat posed by the first vehicle to the user, such as that the first vehicle may collide with the user unless evasive action is taken.
At block 31.301, the process performs determining a threat posed by the first vehicle to some other entity besides the user. As noted, the threat information may indicate a threat posed by the first vehicle to some other person or thing, such as that the first vehicle may collide with the other entity. The other entity may be a vehicle occupied by the user, a vehicle not occupied by the user, a pedestrian, a structure, or any other object that may come into proximity with the first vehicle.
At block 31.401, the process performs determining a threat posed by a vehicle occupied by the user to the first vehicle. The threat information may indicate a threat posed by the user's vehicle (e.g., as a driver or passenger) to the first vehicle, such as because a collision may occur between the two vehicles. The vehicle occupied by the user may be the first vehicle or some other vehicle.
At block 31.501, the process performs determining a threat posed by a vehicle occupied by the user to some other entity besides the first vehicle. The threat information may indicate a threat posed by the user's vehicle to some other person or thing, such as due to a potential collision. The other entity may be some other vehicle, a pedestrian, a structure, or any other object that may come into proximity with the user's vehicle.
At block 31.601, the process performs determining a likelihood that the first vehicle will collide with some other object. In some embodiments, the process may determine a probability or other measure of the likelihood that the first vehicle will collide with some other object, such as another vehicle, a structure, a person, or the like. Such a determination may be made by reference to an object model that models the motions of objects in the roadway based on observations or other information gathered about such objects.
At block 31.701, the process performs determining a likelihood that the first vehicle will collide with the user. For example, the process may determine a probability that the first vehicle will collide with the user or a vehicle occupied by the user.
At block 31.801, the process performs determining that the likelihood that the first vehicle will collide with some other object is greater than a threshold. In some embodiments, the process compares the determined collision likelihood with a threshold. When the likelihood exceeds the threshold, particular actions may be taken, such as presenting a warning to the user or directing the user to take evasive action.
At block 31.901, the process performs determining that the first vehicle is driving erratically. The first vehicle may be driving erratically for a number of reasons, including due to a medical condition (e.g., a heart attack, bad eyesight, shortness of breath), drug/alcohol impairment, distractions (e.g., text messaging, crying children, loud music), or the like. Driving erratically may include driving too fast, too slow, not staying within traffic lanes, or the like.
At block 31.1001, the process performs determining that the first vehicle is driving with excessive speed. Excessive speed may be determined relatively, such as with respect to the average traffic speed on a road segment, posted speed limit, or the like. Similar techniques may be employed to determine if a vehicle is traveling too slowly.
At block 31.1101, the process performs determining that the first vehicle is traveling more than a threshold percentage faster than an average speed of traffic on a road segment. For example, a vehicle may be determined to be driving with excessive speed if the vehicle is driving more than 20% over a historical average speed for the road segment. Other thresholds (e.g., 10% over, 25% over) and/or baselines (e.g., average observed speed at a particular time of day) are contemplated.
At block 31.1201, the process performs determining that the first vehicle is traveling at a speed that is more than a threshold number of standard deviations over an average speed of traffic on a road segment. For example, a vehicle may be determined to be driving with excessive speed if the vehicle is driving more than one standard deviation over the historical average speed. Other baselines may be employed, including average speed for a particular time of day, average speed measured over a time window (e.g., 5 or 10 minutes) preceding the current time, or the like.
At block 31.1501, the process performs receiving an image of the first vehicle from a camera deployed at an intersection. For example, the process may receive images from a camera that is fixed to a traffic light or other signal at an intersection near the first vehicle.
At block 31.1601, the process performs receiving ranging data from a range sensor deployed at an intersection, the ranging data representing a distance between the first vehicle and the intersection. For example, the process may receive a distance (e.g., 75 meters) measured between some known point in the intersection (e.g., the position of the range sensor) and an oncoming vehicle.
At block 31.2501, the process performs receiving motion-related information from the induction loop, the motion-related information including at least one of a position of the first vehicle, a velocity of the first vehicle, and/or a trajectory of the first vehicle. As noted, induction loops may be embedded in the roadway and configured to detect the presence of vehicles passing over them. Some types of loops and/or processing may be employed to detect other information, including velocity, vehicle size, and the like. Multiple induction loops may be configured to work in concert to measure, for example, vehicle velocity.
At block 31.2601, the process performs receiving the information about the first vehicle from a sensor attached to the first vehicle. The first vehicle may include one or more sensors that provide data to the process. For example, the first vehicle may include a camera, a microphone, a GPS receiver, or the like.
At block 31.2701, the process performs receiving the information about the first vehicle from a sensor attached to a second vehicle. The process may obtain information from some other vehicle that is not the first vehicle, such as a vehicle that is behind or in front of the first vehicle.
At block 31.3001, the process performs receiving the information about the first vehicle from a sensor attached to a vehicle that is occupied by the user. In some embodiments, the sensor is attached to a vehicle that is being driven or otherwise operated by the user.
At block 31.3101, the process performs receiving the information about the first vehicle from a sensor attached to a vehicle that is operating autonomously. In some embodiments, the sensor is attached to a vehicle that is operating autonomously, such as by utilizing a guidance or other control system to direct the operation of the vehicle.
At block 31.3201, the process performs receiving the information about the first vehicle from a sensor of the wearable device. The wearable device may include various devices, such as microphones, cameras, range sensors, or the like, that may provide data to the process.
At block 31.3301, the process performs receiving motion-related information about the first vehicle and/or other objects moving about a roadway. The motion-related information may include information about the mechanics (e.g., position, velocity, acceleration, mass) of the user and/or the first vehicle.
At block 31.3401, the process performs receiving position information from a position sensor of the first vehicle. In some embodiments, a GPS receiver, dead reckoning, or some combination thereof may be used to track the position of the first vehicle as it moves down the roadway.
At block 31.3501, the process performs receiving velocity information from a velocity sensor of the first vehicle. In some embodiments, the first vehicle periodically (or on request) transmits its velocity (e.g., as measured by its speedometer) to the process.
At block 31.3601, the process performs determining the threat information based on the motion-related information about the first vehicle. The process may also or instead consider a variety of motion-related information received from other sources, including the wearable device, some other vehicle, a fixed road-side sensor, or the like.
At block 31.3701, the process performs determining the threat information based on information about position, velocity, and/or acceleration of the user obtained from sensors in the wearable device. The wearable device may include position sensors (e.g., GPS), accelerometers, or other devices configured to provide motion-related information about the user to the process.
At block 31.3801, the process performs determining the threat information based on information about position, velocity, and/or acceleration of the user obtained from devices in a vehicle of the user. A vehicle occupied or operated by the user may include position sensors (e.g., GPS), accelerometers, speedometers, or other devices configured to provide motion-related information about the user to the process.
At block 31.3901, the process performs determining the threat information based on information about position, velocity, and/or acceleration of the first vehicle obtained from devices of the first vehicle. The first vehicle may include position sensors (e.g., GPS), accelerometers, speedometers, or other devices configured to provide motion-related information about the user to the process. In other embodiments, motion-related information may be obtained from other sources, such as a radar gun deployed at the side of a road, from other vehicles, or the like.
At block 31.4001, the process performs receiving image data from a camera, the image data representing an image of the first vehicle. The process may receive and consider image data, such as by performing image processing to identify vehicles or other hazards, to determine whether collisions may occur, determine motion-related information about the first vehicle (and possibly other entities), and the like. The image data may be obtained from various sources, including from a camera attached to the wearable device, a vehicle, a road-side structure, or the like.
At block 31.4101, the process performs receiving an image from a camera that is attached to one of a road-side structure, the first vehicle, a second vehicle, a vehicle occupied by the user, or the wearable device.
At block 31.4201, the process performs receiving video data that includes multiple images of the first vehicle taken at different times. In some embodiments, the image data comprises video data in compressed or raw form. The video data typically includes (or can be reconstructed or decompressed to derive) multiple sequential images taken at distinct times.
At block 31.4301, the process performs receiving a first image of the first vehicle taken at a first time.
At block 31.4302, the process performs receiving a second image of the first vehicle taken at a second time, wherein the first and second times are sufficiently different such that velocity and/or direction of travel of the first vehicle may be determined with respect to positions of the first vehicle shown in the first and second images. Various time intervals between images may be utilized. For example, it may not be necessary to receive video data having a high frame rate (e.g., 30 frames per second or higher), because it may be preferable to determine motion or other properties of the first vehicle based on images that are taken at larger time intervals (e.g., one tenth of a second, one quarter of a second). In some embodiments, transmission bandwidth may be saved by transmitting and receiving reduced frame rate image streams.
At block 31.4401, the process performs identifying the first vehicle in the image data. Image processing techniques may be employed to identify the presence of a vehicle, its type (e.g., car or truck), its size, license plate number, color, or other identifying information about the first vehicle.
At block 31.4501, the process performs determining whether the first vehicle is moving towards the user based on multiple images represented by the image data. In some embodiments, a video feed or other sequence of images may be analyzed to determine the relative motion of the first vehicle. For example, if the first vehicle appears to be becoming larger over a sequence of images, then it is likely that the first vehicle is moving towards the user.
At block 31.4601, the process performs determining motion-related information about the first vehicle, based on one or more images of the first vehicle. Motion-related information may include information about the mechanics (e.g., kinematics, dynamics) of the first vehicle, including position, velocity, direction of travel, acceleration, mass, or the like. Motion-related information may be determined for vehicles that are at rest. Motion-related information may be determined and expressed with respect to various frames of reference, including the user's frame of reference, the frame of reference of the first vehicle, a fixed frame of reference, or the like.
At block 31.4701, the process performs determining the motion-related information with respect to timestamps associated with the one or more images. In some embodiments, the received images include timestamps or other indicators that can be used to determine a time interval between the images. In other cases, the time interval may be known a priori or expressed in other ways, such as in terms of a frame rate associated with an image or video stream.
At block 31.4801, the process performs determining a position of the first vehicle. The position of the first vehicle may be expressed absolutely, such as via a GPS coordinate or similar representation, or relatively, such as with respect to the position of the user (e.g., 20 meters away from the first user). In addition, the position of the first vehicle may be represented as a point or collection of points (e.g., a region, arc, or line).
At block 31.4901, the process performs determining a velocity of the first vehicle. The process may determine the velocity of the first vehicle in absolute or relative terms (e.g., with respect to the velocity of the user). The velocity may be expressed or represented as a magnitude (e.g., 10 meters per second), a vector (e.g., having a magnitude and a direction), or the like.
At block 31.5001, the process performs determining the velocity with respect to a fixed frame of reference. In some embodiments, a fixed, global, or absolute frame of reference may be utilized.
At block 31.5101, the process performs determining the velocity with respect to a frame of reference of the user. In some embodiments, velocity is expressed with respect to the user's frame of reference. In such cases, a stationary (e.g., parked) vehicle will appear to be approaching the user if the user is driving towards the first vehicle.
At block 31.5201, the process performs determining a direction of travel of the first vehicle. The process may determine a direction in which the first vehicle is traveling, such as with respect to the user and/or some absolute coordinate system or frame of reference.
At block 31.5301, the process performs determining acceleration of the first vehicle. In some embodiments, acceleration of the first vehicle may be determined, for example by determining a rate of change of the velocity of the first vehicle observed over time.
At block 31.5401, the process performs determining mass of the first vehicle. Mass of the first vehicle may be determined in various ways, including by identifying the type of the first vehicle (e.g., car, truck, motorcycle), determining the size of the first vehicle based on its appearance in an image, or the like.
At block 31.5501, the process performs identifying objects other than the first vehicle in the image data. Image processing techniques may be employed by the process to identify other objects of interest, including road hazards (e.g., utility poles, ditches, drop-offs), pedestrians, other vehicles, or the like.
At block 31.5601, the process performs determining driving conditions based on the image data. Image processing techniques may be employed by the process to determine driving conditions, such as surface conditions (e.g., icy, wet), lighting conditions (e.g., glare, darkness), or the like.
At block 31.5701, the process performs receiving data representing an audio signal emitted or reflected by the first vehicle. The data representing the audio signal may be raw audio samples, compressed audio data, frequency coefficients, or the like. The data representing the audio signal may represent the sound made by the first vehicle, such as from its engine, a horn, tires, or any other source of sound. The data may also or instead represent audio reflected by the vehicle, such as a sonar ping. In some embodiments, the data representing the audio signal may also or instead include sounds from other sources, including other vehicles, pedestrians, or the like.
At block 31.5801, the process performs receiving data obtained at a microphone array that includes multiple microphones. In some embodiments, a microphone array having two or more microphones is employed to receive audio signals. Differences between the received audio signals may be utilized to perform acoustic source localization or other functions, as discussed further herein.
At block 31.5901, the process performs receiving data obtained at a microphone array, the microphone array coupled to a road-side structure. The array may be fixed to a utility pole, a traffic signal, or the like. In other cases, the microphone array may be situated elsewhere, including on the first vehicle, some other vehicle, the wearable device, or the like.
At block 31.6001, the process performs determining the threat information based on the data representing the audio signal. As discussed further below, determining the threat information based on audio may include acoustic source localization, frequency analysis, or other techniques that can identify the presence, position, or motion of objects.
At block 31.6101, the process performs performing acoustic source localization to determine a position of the first vehicle based on multiple audio signals received via multiple microphones. The process may determine a position of the first vehicle by analyzing audio signals received via multiple distinct microphones. For example, engine noise of the first vehicle may have different characteristics (e.g., in volume, in time of arrival, in frequency) as received by different microphones. Differences between the audio signal measured at different microphones may be exploited to determine one or more positions (e.g., points, arcs, lines, regions) at which the first vehicle may be located.
At block 31.6201, the process performs receiving an audio signal via a first one of the multiple microphones, the audio signal representing a sound created by the first vehicle. In one approach, at least two microphones are employed. By measuring differences in the arrival time of an audio signal at the two microphones, the position of the first vehicle may be determined. The determined position may be a point, a line, an area, or the like.
At block 31.6202, the process performs receiving the audio signal via a second one of the multiple microphones.
At block 31.6203, the process performs determining the position of the first vehicle by determining a difference between an arrival time of the audio signal at the first microphone and an arrival time of the audio signal at the second microphone. In some embodiments, given information about the distance between the two microphones and the speed of sound, the process may determine the respective distances between each of the two microphones and the first vehicle. Given these two distances (along with the distance between the microphones), the process can solve for the one or more positions at which the first vehicle may be located.
At block 31.6301, the process performs triangulating the position of the first vehicle based on a first and second angle, the first angle measured between a first one of the multiple microphones and the first vehicle, the second angle measured between a second one of the multiple microphones and the first vehicle. In some embodiments, the microphones may be directional, in that they may be used to determine the direction from which the sound is coming. Given such information, the process may use triangulation techniques to determine the position of the first vehicle.
At block 31.6401, the process performs performing a Doppler analysis of the data representing the audio signal to determine whether the first vehicle is approaching the user. The process may analyze whether the frequency of the audio signal is shifting in order to determine whether the first vehicle is approaching or departing the position of the user. For example, if the frequency is shifting higher, the first vehicle may be determined to be approaching the user. Note that the determination is typically made from the frame of reference of the user (who may be moving or not). Thus, the first vehicle may be determined to be approaching the user when, as viewed from a fixed frame of reference, the user is approaching the first vehicle (e.g., a moving user traveling towards a stationary vehicle) or the first vehicle is approaching the user (e.g., a moving vehicle approaching a stationary user). In other embodiments, other frames of reference may be employed, such as a fixed frame, a frame associated with the first vehicle, or the like.
At block 31.6501, the process performs determining whether frequency of the audio signal is increasing or decreasing.
At block 31.6601, the process performs performing a volume analysis of the data representing the audio signal to determine whether the first vehicle is approaching the user. The process may analyze whether the volume (e.g., amplitude) of the audio signal is shifting in order to determine whether the first vehicle is approaching or departing the position of the user. As noted, different embodiments may use different frames of reference when making this determination.
At block 31.6701, the process performs determining whether volume of the audio signal is increasing or decreasing.
At block 31.6801, the process performs determining threat information that is not related to the first vehicle. The process may determine threat information that is not due or otherwise related to the first vehicle, including based on a variety of other factors or information, such as driving conditions, the presence or absence of other vehicles, the presence or absence of pedestrians, or the like.
At block 31.6901, the process performs receiving and processing information about objects and/or conditions aside from the first vehicle. At least some of the received information may include images of things other than the first vehicle, such as other vehicles, pedestrians, driving conditions, and the like.
At block 31.7001, the process performs receiving information about at least one of a stationary object, a pedestrian, and/or an animal. A stationary object may be a fence, guardrail, utility pole, building, parked vehicle, or the like.
At block 31.7101, the process performs processing the information about the first vehicle to determine the threat information that is not related to the first vehicle. For example, when the received information is image data, the process may determine that a difficult lighting condition exists due to glare or overexposure detected in the image data. As another example, the process may identify a pedestrian in the roadway depicted in the image data. As another example, the process may determine that poor road surface conditions exist, such as due to water or oil on the road surface.
At block 31.7201, the process performs processing information other than the information about the first vehicle to determine the threat information that is not related to the first vehicle. The process may analyze data other than the received information about the first vehicle, such as weather data (e.g., temperature, precipitation), time of day, traffic information, position or motion sensor information (e.g., obtained from GPS systems or accelerometers) related to other vehicles, or the like.
At block 31.7301, the process performs determining that poor driving conditions exist. Poor driving conditions may include or be based on weather information (e.g., snow, rain, ice, temperature), time information (e.g., night or day), lighting information (e.g., a light sensor indicating that the user is traveling towards the setting sun), or the like.
At block 31.7401, the process performs determining that adverse weather conditions exist. Adverse weather conditions may be determined based on weather information received from a weather information system or sensor, such as indications of the current temperature, precipitation, or the like.
At block 31.7501, the process performs determining that a road construction project is present in proximity to the user. The process may receive information from a traffic information system that identifies road segments upon which road construction is present.
At block 31.7601, the process performs determining that a limited visibility condition exists. Limited visibility may be due to the time of day (e.g., at dusk, dawn, or night), weather (e.g., fog, rain), or the like.
At block 31.7701, the process performs determining that there is slow traffic in proximity to the user. The process may receive and integrate information from traffic information systems (e.g., that report accidents), other vehicles (e.g., that are reporting their speeds), or the like.
At block 31.7801, the process performs receiving information from a traffic information system regarding traffic congestion on a road traveled by the user. Traffic information systems may provide fine-grained traffic information, such as current average speeds measured on road segments in proximity to the user.
At block 31.7901, the process performs determining that one or more vehicles are traveling slower than an average or posted speed for a road traveled by the user. Slow travel may be determined based on the speed of one or more vehicles with respect to various baselines, such as average observed speed (e.g., recorded over time, based on time of day, etc.), posted speed limits, recommended speeds based on conditions, or the like.
At block 31.8001, the process performs determining that poor surface conditions exist on a roadway traveled by the user. Poor surface conditions may be due to weather (e.g., ice, snow, rain), temperature, surface type (e.g., gravel road), foreign materials (e.g., oil), or the like.
At block 31.8101, the process performs determining that there is a pedestrian in proximity to the user. The presence of pedestrians may be determined in various ways. In some embodiments, the process may utilize image processing techniques to recognize pedestrians in received image data. In other embodiments pedestrians may wear devices that transmit their location and/or presence. In other embodiments, pedestrians may be detected based on their heat signature, such as by an infrared sensor on the wearable device, user vehicle, or the like.
At block 31.8201, the process performs determining that there is an accident in proximity to the user. Accidents may be identified based on traffic information systems that report accidents, vehicle-based systems that transmit when collisions have occurred, or the like.
At block 31.8301, the process performs determining that there is an animal in proximity to the user. The presence of an animal may be determined as discussed with respect to pedestrians, above.
At block 31.8401, the process performs determining the threat information based on gaze information associated with the user. In some embodiments, the process may consider the direction in which the user is looking when determining the threat information. For example, the threat information may depend on whether the user is or is not looking at the first vehicle, as discussed further below.
At block 31.8501, the process performs receiving an indication of a direction in which the user is looking. In some embodiments, an orientation sensor such as a gyroscope or accelerometer may be employed to determine the orientation of the user's head, face, or other body part. In some embodiments, a camera or other image sensing device may track the orientation of the user's eyes.
At block 31.8502, the process performs determining that the user is not looking towards the first vehicle. As noted, the process may track the position of the first vehicle. Given this information, coupled with information about the direction of the user's gaze, the process may determine whether or not the user is (or likely is) looking in the direction of the first vehicle.
At block 31.8503, the process performs in response to determining that the user is not looking towards the first vehicle, directing the user to look towards the first vehicle. When it is determined that the user is not looking at the first vehicle, the process may warn or otherwise direct the user to look in that direction, such as by saying or otherwise presenting “Look right!”, “Car on your left,” or similar message.
At block 31.8601, the process performs identifying multiple threats to the user. The process may in some cases identify multiple potential threats, such as one car approaching the user from behind and another car approaching the user from the left.
At block 31.8602, the process performs identifying a first one of the multiple threats that is more significant than at least one other of the multiple threats. The process may rank, order, or otherwise evaluate the relative significance or risk presented by each of the identified threats. For example, the process may determine that a truck approaching from the right is a bigger risk than a bicycle approaching from behind. On the other hand, if the truck is moving very slowly (thus leaving more time for the truck and/or the user to avoid it) compared to the bicycle, the process may instead determine that the bicycle is the bigger risk.
At block 31.8603, the process performs instructing the user to avoid the first one of the multiple threats. Instructing the user may include outputting a command or suggestion to take (or not take) a particular course of action.
At block 31.8701, the process performs modeling multiple potential accidents that each correspond to one of the multiple threats to determine a collision force associated with each accident. In some embodiments, the process models the physics of various objects to determine potential collisions and possibly their severity and/or likelihood. For example, the process may determine an expected force of a collision based on factors such as object mass, velocity, acceleration, deceleration, or the like.
At block 31.8702, the process performs selecting the first threat based at least in part on which of the multiple accidents has the highest collision force. In some embodiments, the process considers the threat having the highest associated collision force when determining most significant threat, because that threat will likely result in the greatest injury to the user.
At block 31.8801, the process performs determining a likelihood of an accident associated with each of the multiple threats. In some embodiments, the process associates a likelihood (probability) with each of the multiple threats. Such a probability may be determined with respect to a physical model that represents uncertainty with respect to the mechanics of the various objects that it models.
At block 31.8802, the process performs selecting the first threat based at least in part on which of the multiple threats has the highest associated likelihood. The process may consider the threat having the highest associated likelihood when determining the most significant threat.
At block 31.8901, the process performs determining a mass of an object associated with each of the multiple threats. In some embodiments, the process may consider the mass of threat objects, based on the assumption that those objects having higher mass (e.g., a truck) pose greater threats than those having a low mass (e.g., a pedestrian).
At block 31.8902, the process performs selecting the first threat based at least in part on which of the objects has the highest mass.
At block 31.9001, the process performs selecting the most significant threat from the multiple threats. Threat significance may be based on a variety of factors, including likelihood, cost, potential injury type, and the like.
At block 31.9101, the process performs determining that an evasive action with respect to the first vehicle poses a threat to some other object. The process may consider whether potential evasive actions pose threats to other objects. For example, the process may analyze whether directing the user to turn right would cause the user to collide with a pedestrian or some fixed object, which may actually result in a worse outcome (e.g., for the user and/or the pedestrian) than colliding with the first vehicle.
At block 31.9102, the process performs instructing the user to take some other evasive action that poses a lesser threat to the some other object. The process may rank or otherwise order evasive actions (e.g., slow down, turn left, turn right) based at least in part on the risks or threats those evasive actions pose to other entities.
At block 31.9201, the process performs identifying multiple threats that each have an associated likelihood and cost. In some embodiments, the process may perform a cost-minimization analysis, in which it considers multiple threats, including threats posed to the user and to others, and selects a threat that minimizes or reduces expected costs. The process may also consider threats posed by actions taken by the user to avoid other threats.
At block 31.9202, the process performs determining a course of action that minimizes an expected cost with respect to the multiple threats. Expected cost of a threat may be expressed as a product of the likelihood of damage associated with the threat and the cost associated with such damage.
At block 31.9401, the process performs identifying multiple threats that are each related to different persons or things. In some embodiments, the process considers risks related to multiple distinct entities, possibly including the user.
At block 31.9501, the process performs identifying multiple threats that are each related to the user. In some embodiments, the process also or only considers risks that are related to the user.
At block 31.9601, the process performs minimizing expected costs to the user posed by the multiple threats. In some embodiments, the process attempts to minimize those costs borne by the user. Note that this may cause the process to recommend a course of action that is not optimal from a societal perspective, such as by directing the user to drive his car over a pedestrian rather than to crash into a car or structure.
At block 31.9701, the process performs minimizing overall expected costs posed by the multiple threats, the overall expected costs being a sum of expected costs borne by the user and other persons/things. In some embodiments, the process attempts to minimize social costs, that is, the costs borne by the various parties to an accident. Note that this may cause the process to recommend a course of action that may have a high cost to the user (e.g., crashing into a wall and damaging the user's car) to spare an even higher cost to another person (e.g., killing a pedestrian).
At block 31.9801, the process performs presenting the threat information via an audio output device of the wearable device. The process may play an alarm, bell, chime, voice message, or the like that warns or otherwise informs the user of the threat information. The wearable device may include audio speakers operable to output audio signals, including as part of a set of earphones, earbuds, a headset, a helmet, or the like.
At block 31.9901, the process performs presenting the threat information via a visual display device of the wearable device. In some embodiments, the wearable device includes a display screen or other mechanism for presenting visual information. For example, when the wearable device is a helmet, a face shield of the helmet may be used as a type of heads-up display for presenting the threat information.
At block 31.10001, the process performs displaying an indicator that instructs the user to look towards the first vehicle. The displayed indicator may be textual (e.g., “Look right!”), iconic (e.g., an arrow), or the like.
At block 31.10101, the process performs displaying an indicator that instructs the user to accelerate, decelerate, and/or turn. An example indicator may be or include the text “Speed up,” “slow down,” “turn left,” or similar language.
At block 31.10201, the process performs directing the user to accelerate.
At block 31.10301, the process performs directing the user to decelerate.
At block 31.10401, the process performs directing the user to turn. In some embodiments, the process may provide “turn assistance,” by helping drivers better understand when it is appropriate to make a turn across one or more lanes of oncoming traffic. In such an embodiment, the process tracks vehicles as they approach in intersection to determine whether a vehicle waiting to turn across oncoming lanes of traffic has sufficient cross the lanes without colliding with the approaching vehicles.
At block 31.10501, the process performs directing the user not to turn. As noted, some embodiments provide a turn assistance feature for helping driving to make safe turns across lanes of oncoming traffic.
At block 31.10601, the process performs transmitting to the first vehicle a warning based on the threat information. The process may send or otherwise transmit a warning or other message to the first vehicle that instructs the operator of the first vehicle to take evasive action. The instruction to the first vehicle may be complimentary to any instructions given to the user, such that if both instructions are followed, the risk of collision decreases. In this manner, the process may help avoid a situation in which the user and the operator of the first vehicle take actions that actually increase the risk of collision, such as may occur when the user and the first vehicle are approaching head but do not turn away from one another.
At block 31.10701, the process performs presenting the threat information via an output device of a vehicle of the user, the output device including a visual display and/or an audio speaker. In some embodiments, the process may use other devices to output the threat information, such as output devices of a vehicle of the user, including a car stereo, dashboard display, or the like.
At block 31.11101, the process performs presenting the threat information via goggles worn by the user. The goggles may include a small display, an audio speaker, or haptic output device, or the like.
At block 31.11201, the process performs presenting the threat information via a helmet worn by the user. The helmet may include an audio speaker or visual output device, such as a display that presents information on the inside of the face screen of the helmet. Other output devices, including haptic devices, are contemplated.
At block 31.11301, the process performs presenting the threat information via a hat worn by the user. The hat may include an audio speaker or similar output device.
At block 31.11401, the process performs presenting the threat information via eyeglasses worn by the user. The eyeglasses may include a small display, an audio speaker, or haptic output device, or the like.
At block 31.11501, the process performs presenting the threat information via audio speakers that are part of at least one of earphones, a headset, earbuds, and/or a hearing aid. The audio speakers may be integrated into the wearable device. In other embodiments, other audio speakers (e.g., of a car stereo) may be employed instead or in addition.
At block 31.11601, the process performs performing at the road-based device the determining threat information and/or the presenting the threat information. In some embodiments, the road-based device may be responsible for performing one or more of the operations of the process. For example, the road-based device may be or include a computing system situated at or about a street intersection configured to receive and analyze information about vehicles that are entering or nearing the intersection.
At block 31.11602, the process performs transmitting the threat information from the road-based device to the wearable device of the user. For example, when the road-based computing system determines that two vehicles may be on a collision course, the computing system can transmit threat information to the wearable device so that the user can take evasive action and avoid a possible accident.
At block 31.11701, the process performs performing on a computing system that is remote from the road-based device the determining threat information and/or the presenting the threat information. In some embodiments, a remote computing system may be responsible for performing one or more of the operations of the process. For example, the road-based device may forward the received information to a cloud-based computing system where it is analyzed to determine the threat information.
At block 31.11702, the process performs transmitting the threat information from the road-based device to the wearable device of the user. The cloud-based computing system can transmit threat information to the wearable device so that the user can take evasive action and avoid a possible accident.
At block 31.11801, the process performs receiving data representing threat information relevant to a second vehicle, the second vehicle not being used for travel by the user. As noted, threat information may in some embodiments be shared amongst vehicles, entities, devices, or systems present in a roadway. For example, a second vehicle may have stalled in an intersection that is being approached by the user. This second vehicle may then transmit the fact that it has stalled to the process, which in turn forwards an instruction to slow down to the user. As another example, the second vehicle may transmit an indication of an icy surface condition, which is then forwarded by the process to the user.
At block 31.11802, the process performs determining the threat information based on the data representing threat information relevant to the second vehicle. Having received threat information from the second vehicle, the process may determine that it is also relevant to the user, and then accordingly present it to the user.
At block 31.11901, the process performs receiving from the second vehicle an indication of stalled or slow traffic encountered by the second vehicle. Various types of threat information relevant to the second vehicle may be provided to the process, such as that there is stalled or slow traffic ahead of the second vehicle.
At block 31.12001, the process performs receiving from the second vehicle an indication of poor driving conditions experienced by the second vehicle. The second vehicle may share the fact that it is experiencing poor driving conditions, such as an icy or wet roadway.
At block 31.12101, the process performs receiving from the second vehicle an indication that the first vehicle is driving erratically. The second vehicle may share a determination that the first vehicle is driving erratically, such as by swerving, driving with excessive speed, driving too slowly, or the like.
At block 31.12201, the process performs receiving from the second vehicle an image of the first vehicle. The second vehicle may include one or more cameras, and may share images obtained via those cameras with other entities.
At block 31.12301, the process performs transmitting the threat information to a second vehicle. As noted, threat information may in some embodiments be shared amongst vehicles, entities, devices, or systems present in a roadway. In this example, the threat information is transmitted to a second vehicle (e.g., one following behind the user), so that the second vehicle may benefit from the determined threat information as well.
At block 31.12401, the process performs transmitting the threat information to an intermediary server system for distribution to other vehicles in proximity to the user. In some embodiments, intermediary systems may operate as relays for sharing the threat information with other vehicles and users of a roadway.
At block 31.12501, the process performs transmitting the threat information to a second road-based device situated along a projected course of travel of the first vehicle. For example, the process may transmit the threat information to a second road-based device located at a next intersection or otherwise further along a roadway, so that the second road-based device can take appropriate action, such as warning other vehicles, pedestrians, or the like.
At block 31.12601, the process performs causing the second road-based device to warn drivers that the first vehicle is driving erratically.
At block 31.12701, the process performs causing the second road-based device to control a traffic control signal to inhibit a collision involving the first vehicle. For example, the second road-based device may change a signal from green to red in order to stop other vehicles from entering an intersection when it is determined that the first vehicle is running red lights.
At block 31.12801, the process performs transmitting the threat information to a law enforcement entity. In some embodiments, the process shares the threat information with law enforcement entities, including computer or other information systems managed or operated by such entities. For example, if the process determines that the first vehicle is driving erratically, the process may transmit that determination and/or information about the first vehicle with the police.
At block 31.12901, the process performs determining a license place identifier of the first vehicle based on the image data. The process may perform image processing (e.g., optical character recognition) to determine the license number on the license plate of the first vehicle.
At block 31.12902, the process performs transmitting the license plate identifier to the law enforcement entity.
At block 31.13001, the process performs determining a vehicle description of the first vehicle based on the image data. Image processing may be utilized to determine a vehicle description, including one or more of type, make, year, and/or color of the first vehicle.
At block 31.13002, the process performs transmitting the vehicle description to the law enforcement entity.
At block 31.13101, the process performs determining a location associated with the first vehicle. The process may reference a GPS system to determine the current location of the user and/or the first vehicle, and then provide an indication of that location to the police or other agency. The location may be or include a coordinate, a street or intersection name, a name of a municipality, or the like.
At block 31.13102, the process performs transmitting an indication of the location to the law enforcement entity.
At block 31.13201, the process performs determining a direction of travel of the first vehicle. As discussed above, the process may determine direction of travel in various ways, such as by modeling the motion of the first vehicle. Such a direction may then be provided to the police or other agency, such as by reporting that the first vehicle is traveling northbound.
At block 31.13202, the process performs transmitting an indication of the direction of travel to the law enforcement entity.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 29.100. In addition, the computing system 32.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 29.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 32.400 comprises a computer memory (“memory”) 32.401, a display 32.402, one or more Central Processing Units (“CPU”) 32.403, Input/Output devices 32.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 32.405, and network connections 32.406. The AEFS 29.100 is shown residing in memory 32.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 29.100 may be stored on and/or transmitted over the other computer-readable media 32.405. The components of the AEFS 29.100 preferably execute on one or more CPUs 32.403 and implement techniques described herein. Other code or programs 32.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 32.420, also reside in the memory 32.401, and preferably execute on one or more CPUs 32.403. Of note, one or more of the components in
The AEFS 29.100 interacts via the network 32.450 with wearable devices 29.120, information sources 29.130, and third-party systems/applications 32.455. The network 32.450 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The third-party systems/applications 32.455 may include any systems that provide data to, or utilize data from, the AEFS 29.100, including Web browsers, vehicle-based client systems, traffic tracking, monitoring, or prediction systems, and the like.
The AEFS 29.100 is shown executing in the memory 32.401 of the computing system 32.400. Also included in the memory are a user interface manager 32.415 and an application program interface (“API”) 32.416. The user interface manager 32.415 and the API 32.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 29.100.
The UI manager 32.415 provides a view and a controller that facilitate user interaction with the AEFS 29.100 and its various components. For example, the UI manager 32.415 may provide interactive access to the AEFS 29.100, such that users can configure the operation of the AEFS 29.100, such as by providing the AEFS 29.100 with information about common routes traveled, vehicle types used, driving patterns, or the like. The UI manager 32.415 may also manage and/or implement various output abstractions, such that the AEFS 29.100 can cause vehicular threat information to be displayed on different media, devices, or systems. In some embodiments, access to the functionality of the UI manager 32.415 may be provided via a Web server, possibly executing as one of the other programs 32.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 32.455 can interact with the AEFS 29.100 via the UI manager 32.415.
The API 32.416 provides programmatic access to one or more functions of the AEFS 29.100. For example, the API 32.416 may provide a programmatic interface to one or more functions of the AEFS 29.100 that may be invoked by one of the other programs 32.430 or some other module. In this manner, the API 32.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 29.100 into vehicle-based client systems or devices), and the like.
In addition, the API 32.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the wearable devices 29.120, information sources 29.130, and/or one of the third-party systems/applications 32.455, to access various functions of the AEFS 29.100. For example, an information source 29.130 such as a radar gun installed at an intersection may push motion-related information (e.g., velocity) about vehicles to the AEFS 29.100 via the API 32.416. As another example, a weather information system may push current conditions information (e.g., temperature, precipitation) to the AEFS 29.100 via the API 32.416. The API 32.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 32.455 and that are configured to interact with the AEFS 29.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 29.100 are implemented using standard programming techniques. For example, the AEFS 29.100 may be implemented as a “native” executable running on the CPU 32.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 29.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 32.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 29.100, such as in the data store 32.420 (or 30.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 32.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 29.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
Embodiments described herein provide enhanced computer- and network-based methods and systems for ability enhancement and, more particularly, for enhancing a user's ability to operate or function in a transportation-related context (e.g., as a pedestrian or vehicle operator) by performing threat detection based at least in part on analyzing information received from road-based devices, such as a camera, microphone, or other sensor deployed at the side of a road, at an intersection, or other road-based location. The received information may include image data, audio data, or other data/signals that represent vehicles and other objects or conditions present in a roadway or other context. Example embodiments provide an Ability Enhancement Facilitator System (“AEFS”) that performs at least some of the described techniques. Embodiments of the AEFS may augment, enhance, or improve the senses (e.g., hearing), faculties (e.g., memory, language comprehension), and/or other abilities (e.g., driving, riding a bike, walking/running) of a user.
In some embodiments, the AEFS is configured to identify threats (e.g., posed by vehicles to a user of a roadway, posed by a user to vehicles or other users of a roadway), and to provide information about such threats to a user so that he may take evasive action. Identifying threats may include analyzing information about a vehicle that is present in the roadway in order to determine whether the user and the vehicle may be on a collision course. The analyzed information may include or be represented by image data (e.g., pictures or video of a roadway and its surrounding environment), audio data (e.g., sounds reflected from or emitted by a vehicle), range information (e.g., provided by a sonar or infrared range sensor), conditions information (e.g., weather, temperature, time of day), or the like. The user may be a pedestrian (e.g., a walker, a jogger), an operator of a motorized (e.g., car, motorcycle, moped, scooter) or non-motorized vehicle (e.g., bicycle, pedicab, rickshaw), a vehicle passenger, or the like. In some embodiments, the vehicle may be operating autonomously. In some embodiments, the user wears a wearable device (e.g., a helmet, goggles, eyeglasses, hat) that is configured to at least present determined vehicular threat information to the user.
The AEFS may determine threats based on information received from various sources. Road-based sources may provide image, audio, or other types of data to the AEFS. The road-based sources may include sensors, devices, or systems that are deployed at, within, or about a roadway or intersection. For example, cameras, microphones, range sensors, velocity sensors, and the like may be affixed to utility or traffic signal support structures (e.g., poles, posts). As another example, induction coils embedded within a road can provide information to the AEFS about the presence and/or velocity of vehicles traveling over the road.
In some embodiments, the AEFS is configured to receive image data, at least some of which represents an image of a first vehicle. The image data may be obtained from various sources, including a camera of a wearable device of a user, a camera on a vehicle of the user, a road-side camera, a camera on some other vehicle, or the like. The image data may represent electromagnetic signals of various types or in various ranges, including visual signals (e.g., signals having a wavelength in the range of about 390-750 nm), infrared signals (e.g., signals having a wavelength in the range of about 750 nm-300 micrometers), or the like.
Then, the AEFS determines vehicular threat information based at least in part on the image data. In some embodiments, the AEFS may analyze the received image data in order to identify the first vehicle and/or to determine whether the first vehicle represents a threat to the user, such as because the first vehicle and the user may be on a collision course. The image data may be analyzed in various ways, including by identifying objects (e.g., to recognize that a vehicle or some other object is shown in the image data), determining motion-related information (e.g., position, velocity, acceleration, mass) about objects, or the like.
Next, the AEFS informs the user of the determined vehicular threat information via a wearable device of the user. Typically, the user's wearable device (e.g., a helmet) will include one or more output devices, such as audio speakers, visual display devices (e.g., warning lights, screens, heads-up displays), haptic devices, and the like. The AEFS may present the vehicular threat information via one or more of these output devices. For example, the AEFS may visually display or speak the words “Car on left.” As another example, the AEFS may visually display a leftward pointing arrow on a heads-up screen displayed on a face screen of the user's helmet. Presenting the vehicular threat information may also or instead include presenting a recommended course of action (e.g., to slow down, to speed up, to turn) to mitigate the determined vehicular threat.
The AEFS may use other or additional sources or types of information. For example, in some embodiments, the AEFS is configured to receive data representing an audio signal emitted by a first vehicle. The audio signal is typically obtained in proximity to a user, who may be a pedestrian or traveling in a vehicle as an operator or a passenger. In some embodiments, the audio signal is obtained by one or more microphones coupled to a road-side structure, the user's vehicle and/or a wearable device of the user, such as a helmet, goggles, a hat, a media player, or the like. Then, the AEFS may determine vehicular threat information based at least in part on the data representing the audio signal. In some embodiments, the AEFS may analyze the received data in order to determine whether the first vehicle and the user are on a collision course. The audio data may be analyzed in various ways, including by performing audio analysis, frequency analysis (e.g., Doppler analysis), acoustic localization, or the like.
The AEFS may combine information of various types in order to determine threat information. For example, because image processing may be computationally expensive, rather than always processing all image data obtained from every possible source, the AEFS may use audio analysis to initially determine the approximate location of an oncoming vehicle, such as to the user's left, right, or rear. For example, having determined based on audio data that a vehicle may be approaching from the rear of the user, the AEFS may preferentially process image data from a rear-facing camera to further refine a threat analysis. As another example, the AEFS may incorporate information about the condition of a roadway (e.g., icy or wet) when determining whether a vehicle will be able to stop or maneuver in order to avoid an accident.
In some embodiments, an AEFS may utilize threat information received from other sources, including another AEFS. In particular, in some embodiments, vehicles and devices present in a transportation network may share threat information with one another in order to enhance the abilities of users of the transportation network. In this manner, increased processing power and enhanced responsiveness may be obtained from a network of devices operating in concert with one another.
In one embodiment, a first vehicle receives threat information from a remote device. The remote device may be or execute an AEFS, and may have a fixed (e.g., as a road-based device) or mobile (e.g., in another vehicle, worn by a pedestrian) position. The remote device may itself receive or utilize information from other devices, such as sensors (e.g., cameras, microphones, induction loops) or other computing devices that possibly execute another AEFS or some other system for determining threats.
The received threat information is typically based on information about objects or conditions proximate to the remote device. For example, where the remote device is a computing system located at an intersection, the computing system may process data received from various sensors that are deployed at or about the intersection. The information about objects or conditions may be or include image data, audio data, weather data, motion-related information, or the like.
In some embodiments, when a vehicle receives threat information, it determines whether the threat information is relevant to the safe operation of the vehicle. For example, the vehicle may receive threat information from an intersection-based device that is behind (and receding from) the vehicle. This information is likely not relevant to the vehicle, because the vehicle has already passed through the intersection. As another example, the vehicle may receive an indication of an icy road surface from a device that is ahead of the vehicle. This information is likely to be relevant, because the vehicle is approaching the location of the icy surface. Relevance may generally be determined based on various factors, including location, direction of travel, speed (e.g., an icy surface may not be relevant if the vehicle is moving very slowly), operator skill, or the like.
When a vehicle determines that received threat information is relevant, it may modify operation of the vehicle. Modifying vehicle operation may include presenting a message (e.g., a warning, an instruction) to the vehicle operator with regard to the threat. Modifying vehicle operation may also or instead include controlling the vehicle itself, such as by causing the vehicle to accelerate, decelerate, or turn.
In this example, the moped 33.110a is driving towards the motorcycle 33.110b from a side street, at approximately a right angle with respect to the path of travel of the motorcycle 33.110b. The traffic signal 33.106 has just turned from red to green for the motorcycle 33.110b, and the user 33.104 is beginning to drive the motorcycle 33.110 into the intersection controlled by the traffic signal 33.106. The user 33.104 is assuming that the moped 33.110a will stop, because cross traffic will have a red light. However, in this example, the moped 33.110a may not stop in a timely manner, for one or more reasons, such as because the operator of the moped 33.110a has not seen the red light, because the moped 33.110a is moving at an excessive rate, because the operator of the moped 33.110a is impaired, because the surface conditions of the roadway are icy or slick, or the like. As will be discussed further below, the AEFS 33.100 will determine that the moped 33.110a and the motorcycle 33.110b are likely on a collision course, and inform the user 33.104 of this threat via the helmet 33.120a, so that the user may take evasive action to avoid a possible collision with the moped 33.110a.
The moped 33.110 emits or reflects a signal 33.101. In some embodiments, the signal 33.101 is an electromagnetic signal in the visible light spectrum that represents an image of the moped 33.110a. Other types of electromagnetic signals may be received and processed, including infrared radiation, radio waves, microwaves, or the like. Other types of signals are contemplated, including audio signals, such as an emitted engine noise, a reflected sonar signal, a vocalization (e.g., shout, scream), etc. The signal 33.101 may be received by a receiving detector/device/sensor, such as a camera or microphone (not shown) on the helmet 33.120a and/or the motorcycle 33.110b. In some embodiments, a computing and communication device within the helmet 33.120a receives and samples the signal 33.101 and transmits the samples or other representation to the AEFS 33.100. In other embodiments, other forms of data may be used to represent the signal 33.101, including frequency coefficients, compressed audio/video, or the like.
The AEFS 33.100 determines vehicular threat information by analyzing the received data that represents the signal 33.101. If the signal 33.101 is a visual signal, then the AEFS 33.100 may employ various image data processing techniques. For example, the AEFS 33.100 may perform object recognition to determine that received image data includes an image of a vehicle, such as the moped 33.110a. The AEFS 33.100 may also or instead process received image data to determine motion-related information with respect to the moped 33.110, including position, velocity, acceleration, or the like. The AEFS 33.100 may further identify the presence of other objects, including pedestrians, animals, structures, or the like, that may pose a threat to the user 33.104 or that may be themselves threatened (e.g., by actions of the user 33.104 and/or the moped 33.110a). Image processing also may be employed to determine other information, including road conditions (e.g., wet or icy roads), visibility conditions (e.g., glare or darkness), and the like.
If the signal 33.101 is an audio signal, then the AEFS 33.100 may use one or more audio analysis techniques to determine the vehicular threat information. In one embodiment, the AEFS 33.100 performs a Doppler analysis (e.g., by determining whether the frequency of the audio signal is increasing or decreasing) to determine that the object that is emitting the audio signal is approaching (and possibly at what rate) the user 33.104. In some embodiments, the AEFS 33.100 may determine the type of vehicle (e.g., a heavy truck, a passenger vehicle, a motorcycle, a moped) by analyzing the received data to identify an audio signature that is correlated with a particular engine type or size. For example, a lower frequency engine sound may be correlated with a larger vehicle size, and a higher frequency engine sound may be correlated with a smaller vehicle size.
In one embodiment, where the signal 33.101 is an audio signal, the AEFS 33.100 performs acoustic source localization to determine information about the trajectory of the moped 33.110a, including one or more of position, direction of travel, speed, acceleration, or the like. Acoustic source localization may include receiving data representing the audio signal 33.101 as measured by two or more microphones. For example, the helmet 33.120a may include four microphones (e.g., front, right, rear, and left) that each receive the audio signal 33.101. These microphones may be directional, such that they can be used to provide directional information (e.g., an angle between the helmet and the audio source). Such directional information may then be used by the AEFS 33.100 to triangulate the position of the moped 33.110a. As another example, the AEFS 33.100 may measure differences between the arrival time of the audio signal 33.101 at multiple distinct microphones on the helmet 33.120a or other location. The difference in arrival time, together with information about the distance between the microphones, can be used by the AEFS 33.100 to determine distances between each of the microphones and the audio source, such as the moped 33.110a. Distances between the microphones and the audio source can then be used to determine one or more locations at which the audio source may be located.
Determining vehicular threat information may also or instead include obtaining information such as the position, trajectory, and speed of the user 33.104, such as by receiving data representing such information from sensors, devices, and/or systems on board the motorcycle 33.110b and/or the helmet 33.120a. Such sources of information may include a speedometer, a geo-location system (e.g., GPS system), an accelerometer, or the like. Once the AEFS 33.100 has determined and/or obtained information such as the position, trajectory, and speed of the moped 33.110a and the user 33.104, the AEFS 33.100 may determine whether the moped 33.110a and the user 33.104 are likely to collide with one another. For example, the AEFS 33.100 may model the expected trajectories of the moped 33.110a and user 33.104 to determine whether they intersect at or about the same point in time.
The AEFS 33.100 may then present the determined vehicular threat information (e.g., that the moped 33.110a represents a hazard) to the user 33.104 via the helmet 33.120a. Presenting the vehicular threat information may include transmitting the information to the helmet 33.120a, where it is received and presented to the user. In one embodiment, the helmet 33.120a includes audio speakers that may be used to output an audio signal (e.g., an alarm or voice message) warning the user 33.104. In other embodiments, the helmet 33.120a includes a visual display, such as a heads-up display presented upon a face screen of the helmet 33.120a, which can be used to present a text message (e.g., “Look left”) or an icon (e.g., a red arrow pointing left).
As noted, the AEFS 33.100 may also use information received from road-based sensors and/or devices. For example, the AEFS 33.100 may use information received from a camera 33.108 that is mounted on the traffic signal 33.106 that controls the illustrated intersection. The AEFS 33.100 may receive image data that represents the moped 33.110a and/or the motorcycle 33.110b. The AEFS 33.100 may perform image recognition to determine the type and/or position of a vehicle that is approaching the intersection. The AEFS 33.100 may also or instead analyze multiple images (e.g., from a video signal) to determine the velocity of a vehicle. Other types of sensors or devices installed in or about a roadway may also or instead by used, including range sensors, speed sensors (e.g., radar guns), induction coils (e.g., loops mounted in the roadbed), temperature sensors, weather gauges, or the like.
As noted above, the AEFS 33.100 may utilize data that represents a signal as detected by one or more detectors/sensors, such as microphones or cameras. In the example of
In an image context, the AEFS 33.100 may perform image processing on image data obtained from one or more of the camera sensors 33.124a and 33.124b. As discussed, the image data may be processed to determine the presence of the moped, its type, its motion-related information (e.g., velocity), and the like. In some embodiments, image data may be processed without making any definite identification of a vehicle. For example, the AEFS 33.100 may process image data from sensors 33.124a and 33.124b to identify the presence of motion (without necessarily identifying any objects). Based on such an analysis, the AEFS 33.100 may determine that there is something approaching from the left of the motorcycle 33.110b, but that the right of the motorcycle 33.110b is relatively clear.
Differences between data obtained from multiple sensors may be exploited in various ways. In an image context, an image signal may be perceived or captured differently by the two (camera) sensors 33.124a and 33.124b. The AEFS 33.100 may exploit or otherwise analyze such differences to determine the location and/or motion of the moped 33.110a. For example, knowing the relative position and optical qualities of the two cameras, it is possible to analyze images captured by those cameras to triangulate a position of an object (e.g., the moped 33.110a) or a distance between the motorcycle 33.110b and the object.
In an audio context, an audio signal may be perceived differently by the two sensors 33.124a and 33.124b. For example, if the strength of the signal 33.101 is stronger as measured at microphone 33.124a than at microphone 33.124b, the AEFS 33.100 may infer that the signal 33.101 is originating from the driver's left of the motorcycle 33.110b, and thus that a vehicle is approaching from that direction. As another example, as the strength of an audio signal is known to decay with distance, and assuming an initial level (e.g., based on an average signal level of a vehicle engine) the AEFS 33.100 may determine a distance (or distance interval) between one or more of the microphones and the signal source.
The AEFS 33.100 may model vehicles and other objects, such as by representing their motion-related information, including position, speed, acceleration, mass and other properties. Such a model may then be used to determine whether objects are likely to collide. Note that the model may be probabilistic. For example the AEFS 33.100 may represent an object's position in space as a region that includes multiple positions that each have a corresponding likelihood that that the object is at that position. As another example, the AEFS 33.100 may represent the velocity of an object as a range of likely values, a probability distribution, or the like. Various frames of reference may be employed, including a user-centric frame, an absolute frame, or the like.
The AEFS 33.100 may interact with various types of wearable devices 33.120, including a motorcycle helmet 33.120a (
In some embodiments, a wearable device may perform some or all of the functions of the AEFS 33.100, even though the AEFS 33.100 is depicted as separate in these examples. Some devices may have minimal processing power and thus perform only some of the functions. For example, the eyeglasses 33.120b may receive vehicular threat information from a remote AEFS 33.100, and display it on a heads-up display displayed on the inside of the lenses of the eyeglasses 33.120b. Other wearable devices may have sufficient processing power to perform more of the functions of the AEFS 33.100. For example, the personal media device 33.120e may have considerable processing power and as such be configured to perform acoustic source localization, collision detection analysis, or other more computational expensive functions.
Note that the wearable devices 33.120 may act in concert with one another or with other entities to perform functions of the AEFS 33.100. For example, the eyeglasses 33.120b may include a display mechanism that receives and displays vehicular threat information determined by the personal media device 33.120e. As another example, the goggles 33.120c may include a display mechanism that receives and displays vehicular threat information determined by a computing device in the helmet 33.120a or 33.120d. In a further example, one of the wearable devices 33.120 may receive and process audio data received by microphones mounted on the vehicle 33.110c.
The AEFS 33.100 may also or instead interact with vehicles 33.110 and/or computing devices installed thereon. As noted, a vehicle 33.110 may have one or more sensors or devices that may operate as (direct or indirect) sources of information for the AEFS 33.100. The vehicle 33.110c, for example, may include a speedometer, an accelerometer, one or more microphones, one or more range sensors, or the like. Data obtained by, at, or from such devices of vehicle 33.110c may be forwarded to the AEFS 33.100, possibly by a wearable device 33.120 of an operator of the vehicle 33.110c.
In some embodiments, the vehicle 33.110c may itself have or use an AEFS, and be configured to transmit warnings or other vehicular threat information to others. For example, an AEFS of the vehicle 33.110c may have determined that the moped 33.110a was driving with excessive speed just prior to the scenario depicted in
The AEFS 33.100 may also or instead interact with sensors and other devices that are installed on, in, or about roads or in other transportation related contexts, such as parking garages, racetracks, or the like. In this example, the AEFS 33.100 interacts with the camera 33.108 to obtain images of vehicles, pedestrians, or other objects present in a roadway. Other types of sensors or devices may include range sensors, infrared sensors, induction coils, radar guns, temperature gauges, precipitation gauges, or the like.
The AEFS 33.100 may further interact with information systems that are not shown in
In some embodiments, the AEFS 33.100 may transmit information to law enforcement agencies and/or related computing systems. For example, if the AEFS 33.100 determines that a vehicle is driving erratically, it may transmit that fact along with information about the vehicle (e.g., make, model, color, license plate number, location) to a police computing system.
Note that in some embodiments, at least some of the described techniques may be performed without the utilization of any wearable devices 33.120. For example, a vehicle 33.110 may itself include the necessary computation, input, and output devices to perform functions of the AEFS 33.100. For example, the AEFS 33.100 may present vehicular threat information on output devices of a vehicle 33.110, such as a radio speaker, dashboard warning light, heads-up display, or the like. As another example, a computing device on a vehicle 33.110 may itself determine the vehicular threat information.
In some embodiments, the AEFS 33.100 processes the image 33.140 to perform object identification. Upon processing the image 33.140, the AEFS 33.100 may identify the moped 33.110a, the child 33.141, the sun 33.142, the puddle 33.143, and/or the roadway 33.144. A sequence of images, taken at different times (e.g., one tenth of a second apart) may be used to determine that the moped 33.110a is moving, how fast the moped 33.110a is moving, acceleration/deceleration of the moped 33.110a, or the like. Motion of other objects, such as the child 33.141 may also be tracked. Based on such motion-related information, the AEFS 33.100 may model the physics of the identified objects to determine whether a collision is likely.
Determining vehicular threat information may also or instead be based on factors related or relevant to objects other than the moped 33.110a or the user 33.104. For example, the AEFS 33.100 may determine that the puddle 33.143 will likely make it more difficult for the moped 33.110a to stop. Thus, even if the moped 33.110a is moving at a reasonable speed, he still may be unable to stop prior to entering the intersection due to the presence of the puddle 33.143. As another example, the AEFS 33.100 may determine that evasive action by the user 33.104 and/or the moped 33.110a may cause injury to the child 33.141. As a further example, the AEFS 33.100 may determine that it may be difficult for the user 33.104 to see the moped 33.110a and/or the child 33.141 due to the position of the sun 33.142. Such information may be incorporated into any models, predictions, or determinations made or maintained by the AEFS 33.100.
The scenario of
In this example, the AEFS 33.100 determines that the driver of the motorcycle 33.110b intends to make a left turn. This determination may be based on the fact that the motorcycle 33.110b is slowing down or has activated its turn signals. In some embodiments, when the driver activates a turn signal, an indication of the activation is transmitted to the AEFS 33.100. The AEFS 33.100 then receives information (e.g., image data) about the moped 33.110a from the camera 33.108 and possibly one or more other sources (e.g., a camera, microphone, or other device on the motorcycle 33.110b; a device on the moped 33.110a; a road-embedded device). By analyzing the image data, the AEFS 33.100 can estimate the motion-related information (e.g., position, speed, acceleration) about the moped 33.110a. Based on this motion-related information, the AEFS 33.100 can determine threat information such as whether the moped 33.110a is slowing to stop or instead attempting to speed through the intersection. The AEFS 33.100 can then inform the user of the determined threat information, as discussed further with respect to
The display 33.150 may be used by embodiments of the AEFS to present threat information to users. For example, as discussed with respect to the scenario of
The display 33.150 may be provided in various ways. In one embodiment, the display 33.150 is presented by a heads-up display provided by a vehicle, such as the motorcycle 33.110b, a car, truck, or the like, where the display is presented on the wind screen or other surface. In another embodiment, the display 33.150 may be presented by a heads-up display provided by a wearable device, such as goggles or a helmet, where the display 33.150 is presented on a face or eye shield. In another embodiment, the display 33.150 may be presented by an LCD or similar screen in a dashboard or other portion of a vehicle.
The threat analysis engine 34.210 includes an audio processor 34.212, an image processor 34.214, other sensor data processors 34.216, and an object tracker 34.218. In one example, the audio processor 34.212 processes audio data received from a wearable device 33.120. As noted, such data may be received from other sources as well or instead, including directly from a vehicle-mounted microphone, or the like. The audio processor 34.212 may perform various types of signal processing, including audio level analysis, frequency analysis, acoustic source localization, or the like. Based on such signal processing, the audio processor 34.212 may determine strength, direction of audio signals, audio source distance, audio source type, or the like. Outputs of the audio processor 34.212 (e.g., that an object is approaching from a particular angle) may be provided to the object tracker 34.218 and/or stored in the data store 34.240.
The image processor 34.214 receives and processes image data that may be received from sources such as a wearable device 33.120 and/or information sources 33.130. For example, the image processor 34.214 may receive image data from a camera of a wearable device 33.120, and perform object recognition to determine the type and/or position of a vehicle that is approaching the user 33.104. As another example, the image processor 34.214 may receive a video signal (e.g., a sequence or stream of images) and process them to determine the type, position, and/or velocity of a vehicle that is approaching the user 33.104. Multiple images may be processed to determine the presence or absence of motion, even if no object recognition is performed. Outputs of the image processor 34.214 (e.g., position and velocity information, vehicle type information) may be provided to the object tracker 34.218 and/or stored in the data store 34.240.
The other sensor data processor 34.216 receives and processes data received from other sensors or sources. For example, the other sensor data processor 34.216 may receive and/or determine information about the position and/or movements of the user and/or one or more vehicles, such as based on GPS systems, speedometers, accelerometers, or other devices. As another example, the other sensor data processor 34.216 may receive and process conditions information (e.g., temperature, precipitation) from the information sources 33.130 and determine that road conditions are currently icy. Outputs of the other sensor data processor 34.216 (e.g., that the user is moving at 5 miles per hour) may be provided to the object tracker 34.218 and/or stored in the data store 34.240.
The object tracker 34.218 manages a geospatial object model that includes information about objects known to the AEFS 33.100. The object tracker 34.218 receives and merges information about object types, positions, velocity, acceleration, direction of travel, and the like, from one or more of the processors 34.212, 34.214, 34.216, and/or other sources. Based on such information, the object tracker 34.218 may identify the presence of objects as well as their likely positions, paths, and the like. The object tracker 34.218 may continually update this model as new information becomes available and/or as time passes (e.g., by plotting a likely current position of an object based on its last measured position and trajectory). The object tracker 34.218 may also maintain confidence levels corresponding to elements of the geo-spatial model, such as a likelihood that a vehicle is at a particular position or moving at a particular velocity, that a particular object is a vehicle and not a pedestrian, or the like.
The agent logic 34.220 implements the core intelligence of the AEFS 33.100. The agent logic 34.220 may include a reasoning engine (e.g., a rules engine, decision trees, Bayesian inference engine) that combines information from multiple sources to determine vehicular threat information. For example, the agent logic 34.220 may combine information from the object tracker 34.218, such as that there is a determined likelihood of a collision at an intersection, with information from one of the information sources 33.130, such as that the intersection is the scene of common red-light violations, and decide that the likelihood of a collision is high enough to transmit a warning to the user 33.104. As another example, the agent logic 34.220 may, in the face of multiple distinct threats to the user, determine which threat is the most significant and cause the user to avoid the more significant threat, such as by not directing the user 33.104 to slam on the brakes when a bicycle is approaching from the side but a truck is approaching from the rear, because being rear-ended by the truck would have more serious consequences than being hit from the side by the bicycle.
The presentation engine 34.230 includes a visible output processor 34.232 and an audible output processor 34.234. The visible output processor 34.232 may prepare, format, and/or cause information to be displayed on a display device, such as a display of the presentation device 34.250 (e.g., a heads-up display of a vehicle 33.110 being driven by the user 33.104), a wearable device 33.120, or some other display. The agent logic 34.220 may use or invoke the visible output processor 34.232 to prepare and display information, such as by formatting or otherwise modifying vehicular threat information to fit on a particular type or size of display. The audible output processor 34.234 may include or use other components for generating audible output, such as tones, sounds, voices, or the like. In some embodiments, the agent logic 34.220 may use or invoke the audible output processor 34.234 in order to convert a textual message (e.g., a warning message, a threat identification) into audio output suitable for presentation via the presentation device 34.250, for example by employing a text-to-speech processor.
Note that one or more of the illustrated components/modules may not be present in some embodiments. For example, in embodiments that do not perform image or video processing, the AEFS 33.100 may not include an image processor 34.214. As another example, in embodiments that do not perform audio output, the AEFS 33.100 may not include an audible output processor 34.234.
Note also that the AEFS 33.100 may act in service of multiple users 33.104. In some embodiments, the AEFS 33.100 may determine vehicular threat information concurrently for multiple distinct users. Such embodiments may further facilitate the sharing of vehicular threat information. For example, vehicular threat information determined as between two vehicles may be relevant and thus shared with a third vehicle that is in proximity to the other two vehicles.
FIGS. 35.1-35.93 are example flow diagrams of ability enhancement processes performed by example embodiments.
At block 35.101, the process performs at a first vehicle, receiving threat information from a remote device, the threat information based at least in part on information about objects and/or conditions proximate to the remote device. In some embodiments, threat information determined by other devices or systems is received at the first vehicle. For example, the first vehicle may receive threat information from a road-based device that has determined that some other vehicle is driving erratically. As another example, the first vehicle may receive threat information from some other vehicle that has detected icy conditions on the roadway. The remote device may be any fixed device (e.g., at or about the roadway) or mobile device (e.g., located in another vehicle, on another person) that is capable of providing threat information to the process.
At block 35.102, the process performs determining that the threat information is relevant to safe operation of the first vehicle. The process may determine that the received threat information is relevant in various ways. For example, the process may determine whether the first vehicle is heading towards a location associated with the threat information (e.g., an upcoming intersection), and if so, present the threat information to the driver of the first vehicle.
At block 35.103, the process performs modifying operation of the first vehicle based on the threat information. Modifying the operation of the first vehicle may include presenting a message based on the threat information to the driver or other occupant of the first vehicle. Modifying the operation may also or instead include modifying controls (e.g., accelerator, brakes, steering wheel, lights) of the first vehicle.
At block 35.201, the process performs receiving threat information determined based on information about driving conditions proximate to the remote device. Information about driving conditions may include or be based on weather information (e.g., snow, rain, ice, temperature), time information (e.g., night or day), lighting information (e.g., a light sensor indicating glare from the setting sun), or the like.
At block 35.801, the process performs receiving threat information determined based on information about a second vehicle proximate to the remote device. The information about the second vehicle may be or indicate unusual or out of the ordinary conditions with respect to the second vehicle, such as that the second vehicle is driving erratically, with excessive speed, or the like. The information about the second vehicle may be motion-related information (e.g., velocity, trajectory) or higher-order information, such as a determination that the second vehicle is driving erratically.
At block 35.1201, the process performs receiving threat information determined based on information about a pedestrian proximate to the remote device. The information about the pedestrian may be or be based on an image of the pedestrian, an audio signal received from the pedestrian, an infrared heat signal of the pedestrian, location information received from a mobile device of the pedestrian, or the like.
At block 35.1401, the process performs receiving threat information determined based on information about an object in a roadway proximate to the remote device. Other objects, including animals, refuse, tree limbs, parked vehicles, or the like may be considered.
At block 35.1501, the process performs receiving threat information determined at a second vehicle with respect to information about objects and/or conditions received at the second vehicle. In some embodiments, the threat information is determined by a second vehicle, such as by an AEFS or similar system that is present in the second vehicle (e.g., on a mobile device of an occupant or installed in the vehicle). In this manner, efforts made by other systems to determine threat information may be shared with this process, as well as possibly other systems, devices, or processes.
At block 35.1601, the process performs receiving threat information determined by a wearable device of an occupant of the second vehicle. In some embodiments, the occupant of the second vehicle has a wearable device that executes an AEFS or similar system to determine the threat information. This threat information is then transmitted to, and received by, the process.
At block 35.1701, the process performs receiving threat information determined by a computing device installed in the second vehicle. In some embodiments, the second vehicle includes a computing device that executes an AEFS or similar system to determine the threat information. This threat information is then transmitted to, and received by, the process.
At block 35.1801, the process performs receiving motion-related information from a sensor attached to the second vehicle. The motion-related information may include information about the mechanics (e.g., position, velocity, acceleration, mass) of the second vehicle. Various types of sensors are contemplated, including speedometers, GPS receivers, accelerometers, and the like.
At block 35.1901, the process performs receiving position information from a position sensor of the second vehicle. In some embodiments, a GPS receiver, dead reckoning, or some combination thereof may be used to track the position of the first vehicle as it moves down the roadway.
At block 35.2001, the process performs receiving velocity information from a velocity sensor of the second vehicle. In some embodiments, a GPS receiver, a speedometer or other device is employed to determine the velocity of the second vehicle.
At block 35.2101, the process performs receiving threat information determined by a road-based device with respect to information about objects and/or conditions received at the road-based device. In some embodiments, the threat information is determined by a road-based device, such as a sensor or computing device. For example a computing device located at an intersection may determine threat information about vehicles and other objects entering into the intersection. This threat information may be shared with vehicles in the vicinity of the intersection, including the first vehicle. In this manner, efforts made by other systems to determine threat information may be shared with this process, as well as possibly other systems, devices, or processes.
At block 35.2201, the process performs receiving threat information determined by a road-based computing device configured to receive the information about objects and/or conditions from vehicles proximate to the road-based computing device. The road-based device may be a computing device that executes an AEFS or similar system, and that shares determined threat information with the process, as well as other systems in the vicinity of the road-based device. The road-based device may receive information from vehicles, such as motion-related information that can be employed to track and/or predict the motion of those vehicles. The road-based device may be place at or about locations that are frequent accident sites, such as intersections, blind corners, or the like.
At block 35.2301, the process performs receiving threat information determined by a road-based computing device configured to receive the information about objects and/or conditions from road-based sensors. The road-based computing device may receive information from road-based sensors, such as items attached to structures or embedded in the roadway, including cameras, ranging devices, speed sensors, or the like.
At block 35.2601, the process performs receiving an image of a second vehicle from a camera deployed at an intersection. For example, the process may receive images of a second vehicle from a camera that is fixed to a traffic light or other signal at an intersection near the first vehicle.
At block 35.2701, the process performs receiving ranging data from a range sensor deployed at an intersection, the ranging data representing a distance between a second vehicle and the intersection. For example, the process may receive a distance (e.g., 75 meters) measured between some known point in the intersection (e.g., the position of the range sensor) and an oncoming vehicle.
At block 35.3501, the process performs receiving motion-related information from the induction loop, the motion-related information including at least one of a position of the second vehicle, a velocity of the second vehicle, and/or a trajectory of the second vehicle. As noted, induction loops may be embedded in the roadway and configured to detect the presence of vehicles passing over them. Some types of loops and/or processing may be employed to detect other information, including velocity, vehicle size, and the like. Multiple induction loops may be configured to work in concert to measure, for example, vehicle velocity.
At block 35.3601, the process performs determining a location associated with the remote device. In some embodiments, determining the threat information may be based on the relative locations of the first vehicle and the remote device. In general, the closer the first vehicle is to the remote device, the more likely that the threat information provided by that remote device is relevant to the first vehicle. For example, threat information provided by a device being approached by the first vehicle is likely to be more relevant than threat information provided by a device that is behind the first vehicle. The location may be expressed as a point or a region (e.g., a polygon, circle).
At block 35.3602, the process performs determining whether the first vehicle is approaching the location. The process may use information about the current position and direction of travel of the first vehicle (e.g., provided by a GPS receiver) and the location of the remote device.
At block 35.3901, the process performs determining whether the first vehicle is within a threshold distance from the location associated with the remote device. The process may determine whether the distance between the first vehicle and the device location is less than a threshold number. The threshold number may be a fixed number (e.g., 10, 20, 50, 100 meters) or be based on various factors including vehicle speeds, surface conditions, driver skill level, and the like. For example, a higher threshold number may be used if the surface conditions are icy and thus require a greater stopping distance.
At block 35.4001, the process performs determining whether the first vehicle is moving on trajectory that intersects the location associated with the remote device. The process may determine that the threat information is relevant if the first vehicle is moving on a trajectory that intersects (or nearly intersects) the location. In this manner, threat information about locations that are behind or to the side of the vehicle may be ignored or filtered out.
At block 35.4101, the process performs determining a threat to the first vehicle based on the threat information. The process may determine that the threat represented by the threat information is also a threat to the first vehicle. For example, if the threat information identifies an erratic vehicle, that erratic vehicle may also pose a threat to the first vehicle. Alternatively, the process may determine a distinct threat to the first vehicle based on a threat represented by the threat information. For example, the threat information may indicate that the setting sun is causing a visibility problem for a second vehicle that happens to be approaching the first vehicle. From this, the process may infer that the second vehicle poses a threat (because the driver cannot see) to the first vehicle, even though the setting sun does not in and of itself pose a direct problem or threat for the first vehicle.
At block 35.4102, the process performs determining a likelihood associated with the threat. In some embodiments, probabilities may be associated with threats, based on various factors, such as levels of uncertainty associated with measurements or other data used by the process, aggregate risk levels (e.g., number of accidents per year in a given intersection), or the like.
At block 35.4103, the process performs determining that the likelihood is greater than a threshold level. The process may determine that threat information is relevant when the likelihood is above a particular threshold. The threshold may be fixed (e.g., 10%, 20%) or based on various factors including vehicle speeds, surface conditions, driver skill level, and the like.
At block 35.4201, the process performs predicting a path of an object identified by the threat information. The process may model the path of the object by using motion-related information obtained about or provided by the object. The path may include a vector that represents velocity and direction of travel. The path may also represent at-rest, non-moving objects.
At block 35.4202, the process performs predicting a path of the first vehicle. Similarly, the process may model the path of the first vehicle based on motion-related information.
At block 35.4203, the process performs determining, based on the paths of the object and the first vehicle, whether the first vehicle and the object will come within a threshold distance of one another. A threshold distance may be used to detect situations in which even though there is no collision, the first vehicle and the object pass uncomfortably close to one another (e.g., a “near miss”). Different thresholds are contemplated, including 0, 10 cm, 25 cm, and 1 m.
At block 35.4301, the process performs determining a likelihood that the first vehicle will collide with a second vehicle identified by the threat information. In some cases, the object identified by the threat information may be a second vehicle, and the process may determine a likelihood of collision, based on current positions and trajectories of the two vehicles, uncertainty about the data used to determine the trajectories, and/or other factors.
At block 35.4401, the process performs determining a likelihood that the first vehicle will collide with a pedestrian identified by the threat information. In some cases, the object identified by the threat information may be a pedestrian, and the process may determine a likelihood of collision between the first vehicle and the pedestrian, based on current positions and trajectories, uncertainty about the data used to determine the trajectories, and/or other factors.
At block 35.4501, the process performs determining a likelihood that the first vehicle will collide with an animal identified by the threat information.
At block 35.4601, the process performs determining a likelihood that surface conditions identified by the threat information will cause an operator to lose control of the first vehicle. In some cases, the threat information will identify hazard surface conditions, such as ice. The process may determine a likelihood that the operator of the first vehicle will not be able to control the vehicle in the presence of such surface conditions. Such a likelihood may be based on various factors, such as whether the vehicle is presently turning (and thus more likely to spin out in the presence of ice), whether the vehicle is braking, or the like. The likelihood may also be based on the specific type of surface condition, with icy conditions resulting in higher likelihoods than wet conditions, for example.
At block 35.4701, the process performs determining that the threat information is relevant based on gaze information associated with an operator of the first vehicle. In some embodiments, the process may consider the direction in which the vehicle operator is looking when determining that the threat information is relevant. For example, the threat information may depend on whether the operator is or is not looking in the direction of a threat (e.g., another vehicle) identified by the threat information, as discussed further below.
At block 35.4801, the process performs determining that the operator has not looked at the road for more than a threshold amount of time. In some cases, the process may consider whether the operator has taken his eyes off the road, such as to adjust the car radio, attend to a mobile phone, or the like.
At block 35.4901, the process performs receiving an indication of a direction in which the operator is looking. In some embodiments, an orientation sensor such as a gyroscope or accelerometer may be employed to determine the orientation of the operator's head, face, eyes, or other body part. In some embodiments, a camera or other image sensing device may track the orientation of the operator's eyes.
At block 35.4902, the process performs determining that the operator is not looking towards an approaching second vehicle. As noted, received threat information (or a tracking system of employed by the process) may indicate the position of a second vehicle. Given this information, coupled with information about the direction of the operator's gaze, the process may determine whether or not the operator is (or likely is) looking in the direction of the second vehicle.
At block 35.4903, the process performs in response to determining that the operator is not looking towards the second vehicle, directing the operator to look towards the second vehicle. When it is determined that the operator is not looking at the second vehicle, the process may warn or otherwise direct the operator to look in that direction, such as by saying or otherwise presenting “Look right!”, “Car on your left,” or similar message.
At block 35.5001, the process performs presenting a message based on the threat information to an operator of the first vehicle. The process may present (e.g., display an image, play audio) a message, such as a warning or instruction that is based on the threat information. For example, if the threat information identifies icy surface conditions, the message may instruct or recommend that the operator slow down.
At block 35.5101, the process performs presenting a warning to the operator.
At block 35.5201, the process performs presenting an instruction to the operator.
At block 35.5301, the process performs directing the operator to accelerate or decelerate.
At block 35.5401, the process performs directing the operator to turn or not to turn. In some embodiments, the process may provide “turn assistance,” by helping drivers better understand when it is appropriate to make a turn across one or more lanes of oncoming traffic. In such an embodiment, the process tracks vehicles as they approach an intersection to determine whether a vehicle waiting to turn across oncoming lanes of traffic has sufficient time, distance, or clearance to cross the lanes without colliding with the approaching vehicles.
At block 35.5501, the process performs presenting the message via an audio output device. The process may play an alarm, bell, chime, voice message, or the like that warns or otherwise informs the user of the threat information. The wearable device may include audio speakers operable to output audio signals, including as part of a set of earphones, earbuds, a headset, a helmet, or the like.
At block 35.5901, the process performs presenting the message via a visual display device. In some embodiments, the wearable device includes a display screen or other mechanism for presenting visual information. For example, when the wearable device is a helmet, a face shield of the helmet may be used as a type of heads-up display for presenting the threat information.
At block 35.6201, the process performs displaying an indicator that instructs the operator to look towards an oncoming vehicle identified by the threat information. The displayed indicator may be textual (e.g., “Look right!”), iconic (e.g., an arrow), or the like.
At block 35.6301, the process performs displaying an indicator that instructs the operator to accelerate, decelerate, and/or turn. An example indicator may be or include the text “Speed up,” “slow down,” “turn left,” or similar language.
At block 35.6401, the process performs providing tactile feedback to the user. Tactile feedback may include temperature or positional changes of an object (e.g., steering wheel, seat, pedal) within the first vehicle.
At block 35.6501, the process performs causing a steering device, seat, and/or pedal of the first vehicle to vibrate.
At block 35.6601, the process performs controlling the first vehicle. In some embodiments, the process may directly modify the operation of the first vehicle by controlling it in some manner, such as by changing the steering, braking, accelerating, or the like.
At block 35.6701, the process performs decreasing speed of the first vehicle by applying brakes of the first vehicle and/or by reducing output of an engine of the first vehicle. The process may slow the vehicle by one or more of braking or reducing engine output.
At block 35.6801, the process performs increasing speed of the first vehicle by releasing brakes of the first vehicle and/or by increasing output of an engine of the first vehicle. The process may speed up the vehicle by one or more of releasing the brakes or increasing engine output.
At block 35.6901, the process performs changing direction of the first vehicle. The process may change direction of the vehicle by modifying the angle of the wheels of the vehicle.
At block 35.7001, the process performs receiving threat information determined based on image data. The process may receive threat information that is based on image data. Image data may be used for performing image processing to identify vehicles or other hazards, to determine whether collisions may occur, determine motion-related information about the first vehicle (and possibly other entities), and the like. The image data may be obtained from various sources, including from a camera attached to a wearable device, a vehicle, a road-side structure, or the like.
At block 35.7101, the process performs receiving threat information determined based on image data received from a camera that is attached to at least one of: a road-side structure, a wearable device of a pedestrian, or a second vehicle.
At block 35.7201, the process performs receiving threat information determined based on image data that includes multiple images of a second vehicle taken at different times. In some embodiments, the image data comprises video data in compressed or raw form. The video data typically includes (or can be reconstructed or decompressed to derive) multiple sequential images taken at distinct times. Various time intervals between images may be utilized. For example, it may not be necessary to receive video data having a high frame rate (e.g., 30 frames per second or higher), because it may be preferable to determine motion or other properties of the first vehicle based on images that are taken at larger time intervals (e.g., one tenth of a second, one quarter of a second). In some embodiments, transmission bandwidth may be saved by transmitting and receiving reduced frame rate image streams.
At block 35.7301, the process performs receiving threat information that includes motion-related information about a second vehicle based on one or more images of the second vehicle, the motion-related information including at least one of a position, velocity, acceleration, and/or mass of the second vehicle. Motion-related information may include information about the mechanics (e.g., kinematics, dynamics) of the second vehicle, including position, velocity, direction of travel, acceleration, mass, or the like. Motion-related information may be determined for vehicles that are at rest. Motion-related information may be determined and expressed with respect to various frames of reference, including the frame of reference of the first/second vehicle, a fixed frame of reference, a global frame of reference, or the like. For example, the position of the second vehicle may be expressed absolutely, such as via a GPS coordinate or similar representation, or relatively, such as with respect to the position of the user (e.g., 20 meters away from the first user). In addition, the position of the second vehicle may be represented as a point or collection of points (e.g., a region, arc, or line). As another example, the velocity of the second vehicle may be expressed in absolute or relative terms (e.g., with respect to the velocity of the first vehicle). The velocity may be expressed or represented as a magnitude (e.g., 10 meters per second), a vector (e.g., having a magnitude and a direction), or the like. In other embodiments, velocity may be expressed with respect to the first vehicle's frame of reference. In such cases, a stationary (e.g., parked) vehicle will appear to be approaching the user if the first vehicle is driving towards the second vehicle. In some embodiments, acceleration of the second vehicle may be determined, for example by determining a rate of change of the velocity of the second vehicle observed over time. Mass of the second vehicle may be determined in various ways, including by identifying the type of the second vehicle (e.g., car, truck, motorcycle), determining the size of the second vehicle based on its appearance in an image, or the like. In some embodiments, the images may include timestamps or other indicators that can be used to determine a time interval between the images. In other cases, the time interval may be known a priori or expressed in other ways, such as in terms of a frame rate associated with an image or video stream.
At block 35.7401, the process performs receiving threat information that identifies objects other than vehicles in the image data. Image processing techniques may be employed to identify other objects of interest, including road hazards (e.g., utility poles, ditches, drop-offs), pedestrians, other vehicles, or the like.
At block 35.7501, the process performs receiving threat information that includes driving conditions information based on the image data. Image processing techniques may be employed to determine driving conditions, such as surface conditions (e.g., icy, wet), lighting conditions (e.g., glare, darkness), or the like.
At block 35.7601, the process performs receiving threat information determined based on audio data representing an audio signal emitted or reflected by an object. The data representing the audio signal may be raw audio samples, compressed audio data, frequency coefficients, or the like. The data representing the audio signal may represent the sound made by the object, such as from a vehicle engine, a horn, tires, or any other source of sound. The data may also or instead represent audio reflected by an object, such as a sonar ping. The object may be a vehicle, a pedestrian, an animal, a fixed structure, or the like.
At block 35.7701, the process performs receiving threat information determined based on audio data obtained at a microphone array that includes multiple microphones. In some embodiments, a microphone array having two or more microphones is employed to receive audio signals. Differences between the received audio signals may be utilized to perform acoustic source localization or other functions, as discussed further herein.
At block 35.7801, the process performs receiving threat information determined based on audio data obtained at a microphone array, the microphone array coupled to a road-side structure. The array may be fixed to a utility pole, a traffic signal, or the like. In other cases, the microphone array may be situated elsewhere, including on the first vehicle, some other vehicle, a wearable device of a person, or the like.
At block 35.7901, the process performs receiving threat information determined based on acoustic source localization performed to determine a position of the object based on multiple audio signals received via multiple microphones. The position of the object may be determined by analyzing audio signals received via multiple distinct microphones. For example, engine noise of the second vehicle may have different characteristics (e.g., in volume, in time of arrival, in frequency) as received by different microphones. Differences between the audio signals measured at different microphones may be exploited to determine one or more positions (e.g., points, arcs, lines, regions) at which the object may be located. In one approach, at least two microphones are employed. By measuring differences in the arrival time of an audio signal at the two microphones, the position of the object may be determined. The determined position may be a point, a line, an area, or the like. In some embodiments, given information about the distance between the two microphones and the speed of sound, respective distances between each of the two microphones and the object may be determined. Given these two distances (along with the distance between the microphones), the process can solve for the one or more positions at which the second vehicle may be located. In some embodiments, the microphones may be directional, in that they may be used to determine the direction from which the sound is coming. Given such information, triangulation techniques may be employed to determine the position of the object.
At block 35.8001, the process performs identifying multiple threats to the first vehicle, at least one of which is based on the threat information. The process may in some cases identify multiple potential threats, such as one car approaching the first vehicle from behind and another car approaching the first vehicle from the left.
At block 35.8002, the process performs identifying a first one of the multiple threats that is more significant than at least one other of the multiple threats. The process may rank, order, or otherwise evaluate the relative significance or risk presented by each of the identified threats. For example, the process may determine that a truck approaching from the right is a bigger risk than a bicycle approaching from behind. On the other hand, if the truck is moving very slowly (thus leaving more time for the truck and/or the first vehicle to avoid it) compared to the bicycle, the process may instead determine that the bicycle is the bigger risk.
At block 35.8003, the process performs instructing an operator of the first vehicle to avoid the first one of the multiple threats. Instructing the operator may include outputting a command or suggestion to take (or not take) a particular course of action.
At block 35.8101, the process performs selecting the most significant threat from the multiple threats.
At block 35.8201, the process performs modeling multiple potential accidents that each correspond to one of the multiple threats to determine a collision force associated with each accident. In some embodiments, the process models the physics of various objects to determine potential collisions and possibly their severity and/or likelihood. For example, the process may determine an expected force of a collision based on factors such as object mass, velocity, acceleration, deceleration, or the like.
At block 35.8202, the process performs selecting the first threat based at least in part on which of the multiple accidents has the highest collision force. In some embodiments, the process considers the threat having the highest associated collision force when determining most significant threat, because that threat will likely result in the greatest injury to the first vehicle and/or its occupants.
At block 35.8301, the process performs determining a likelihood of an accident associated with each of the multiple threats. In some embodiments, the process associates a likelihood (probability) with each of the multiple threats. Such a probability may be determined with respect to a physical model that represents uncertainty with respect to the mechanics of the various objects that it models.
At block 35.8302, the process performs selecting the first threat based at least in part on which of the multiple threats has the highest associated likelihood. The process may consider the threat having the highest associated likelihood when determining the most significant threat.
At block 35.8401, the process performs determining a mass of an object associated with each of the multiple threats. In some embodiments, the process may consider the mass of threat objects, based on the assumption that those objects having higher mass (e.g., a truck) pose greater threats than those having a low mass (e.g., a pedestrian).
At block 35.8402, the process performs selecting the first threat based at least in part on which of the objects has the highest mass, without reference to velocity or acceleration of the object. Mass may thus be used as a proxy for collision force, particularly when it is difficult to determine other information (e.g., velocity) about objects.
At block 35.8501, the process performs determining that an evasive action with respect to the threat information poses a threat to some other object. The process may consider whether potential evasive actions pose threats to other objects. For example, the process may analyze whether directing the operator of the first vehicle to turn right (to avoid a collision with a second vehicle) would cause the first vehicle to instead collide with a pedestrian or some fixed object, which may actually result in a worse outcome (e.g., for the operator and/or the pedestrian) than colliding with the second vehicle.
At block 35.8502, the process performs instructing an operator of the first vehicle to take some other evasive action that poses a lesser threat to the some other object. The process may rank or otherwise order evasive actions (e.g., slow down, turn left, turn right) based at least in part on the risks or threats those evasive actions pose to other entities.
At block 35.8601, the process performs identifying multiple threats that each have an associated likelihood and cost, at least one of which is based on the threat information. In some embodiments, the process may perform a cost-minimization analysis, in which it considers multiple threats, including threats posed to the vehicle operator and to others, and selects a threat that minimizes or reduces expected costs. The process may also consider threats posed by actions taken by the vehicle operator to avoid other threats.
At block 35.8602, the process performs determining a course of action that minimizes an expected cost with respect to the multiple threats. The expected cost of a threat may be expressed as a product of the likelihood of damage associated with the threat and the cost associated with such damage.
At block 35.8801, the process performs identifying multiple threats that are each related to different persons or things. In some embodiments, the process considers risks related to multiple distinct entities, possibly including the operator of the first vehicle.
At block 35.8901, the process performs identifying multiple threats that are each related to the first vehicle and/or an operator thereof. In some embodiments, the process also or only considers risks that are related to the operator of the first vehicle and/or the first vehicle itself.
At block 35.9001, the process performs minimizing expected costs to the operator of the first vehicle posed by the multiple threats. In some embodiments, the process attempts to minimize those costs borne by the operator of the first vehicle. Note that this may in some cases cause the process to recommend a course of action that is not optimal from a societal or aggregate perspective, such as by directing the operator to take an evasive action that may cause or contribute to an accident involving other vehicles. Such an action may spare the first vehicle and its operator, but cause a greater injury to other parties.
At block 35.9101, the process performs minimizing overall expected costs posed by the multiple threats, the overall expected costs being a sum of expected costs borne by an operator of the first vehicle and other persons/things. In some embodiments, the process attempts to minimize social costs, that is, the costs borne by the various parties to an accident. Note that this may cause the process to recommend a course of action that may have a high cost to the user (e.g., crashing into a wall and damaging the user's car) to spare an even higher cost to another person (e.g., killing a pedestrian).
At block 35.9201, the process performs performing the receiving threat information, the determining that the threat information is relevant to safe operation of the first vehicle, and/or the modifying operation of the first vehicle at a computing system of the first vehicle.
At block 35.9301, the process performs performing the determining that the threat information is relevant to safe operation of the first vehicle and/or the modifying operation of the first vehicle at a computing system remote from the first vehicle.
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the AEFS 33.100. In addition, the computing system 36.400 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the AEFS 33.100 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 36.400 comprises a computer memory (“memory”) 36.401, a display 36.402, one or more Central Processing Units (“CPU”) 36.403, Input/Output devices 36.404 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 36.405, and network connections 36.406. The AEFS 33.100 is shown residing in memory 36.401. In other embodiments, some portion of the contents, some or all of the components of the AEFS 33.100 may be stored on and/or transmitted over the other computer-readable media 36.405. The components of the AEFS 33.100 preferably execute on one or more CPUs 36.403 and implement techniques described herein. Other code or programs 36.430 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 36.420, also reside in the memory 36.401, and preferably execute on one or more CPUs 36.403. Of note, one or more of the components in
The AEFS 33.100 interacts via the network 36.450 with wearable devices 33.120, information sources 33.130, and third-party systems/applications 36.455. The AEFS 33.100 may also generally interact with other output devices, such as the presentation device 34.250 described with respect to
The AEFS 33.100 is shown executing in the memory 36.401 of the computing system 36.400. Also included in the memory are a user interface manager 36.415 and an application program interface (“API”) 36.416. The user interface manager 36.415 and the API 36.416 are drawn in dashed lines to indicate that in other embodiments, functions performed by one or more of these components may be performed externally to the AEFS 33.100.
The UI manager 36.415 provides a view and a controller that facilitate user interaction with the AEFS 33.100 and its various components. For example, the UI manager 36.415 may provide interactive access to the AEFS 33.100, such that users can configure the operation of the AEFS 33.100, such as by providing the AEFS 33.100 with information about common routes traveled, vehicle types used, driving patterns, or the like. The UI manager 36.415 may also manage and/or implement various output abstractions, such that the AEFS 33.100 can cause vehicular threat information to be displayed on different media, devices, or systems. In some embodiments, access to the functionality of the UI manager 36.415 may be provided via a Web server, possibly executing as one of the other programs 36.430. In such embodiments, a user operating a Web browser executing on one of the third-party systems 36.455 can interact with the AEFS 33.100 via the UI manager 36.415.
The API 36.416 provides programmatic access to one or more functions of the AEFS 33.100. For example, the API 36.416 may provide a programmatic interface to one or more functions of the AEFS 33.100 that may be invoked by one of the other programs 36.430 or some other module. In this manner, the API 36.416 facilitates the development of third-party software, such as user interfaces, plug-ins, adapters (e.g., for integrating functions of the AEFS 33.100 into vehicle-based client systems or devices), and the like.
In addition, the API 36.416 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as code executing on one of the wearable devices 33.120, information sources 33.130, and/or one of the third-party systems/applications 36.455, to access various functions of the AEFS 33.100. For example, an information source 33.130 such as a radar gun installed at an intersection may push motion-related information (e.g., velocity) about vehicles to the AEFS 33.100 via the API 36.416. As another example, a weather information system may push current conditions information (e.g., temperature, precipitation) to the AEFS 33.100 via the API 36.416. The API 36.416 may also be configured to provide management widgets (e.g., code modules) that can be integrated into the third-party applications 36.455 and that are configured to interact with the AEFS 33.100 to make at least some of the described functionality available within the context of other applications (e.g., mobile apps).
In an example embodiment, components/modules of the AEFS 33.100 are implemented using standard programming techniques. For example, the AEFS 33.100 may be implemented as a “native” executable running on the CPU 36.403, along with one or more static or dynamic libraries. In other embodiments, the AEFS 33.100 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 36.430. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the AEFS 33.100, such as in the data store 36.420 (or 34.240), can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 36.420 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the AEFS 33.100 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the components and/or data structures may be stored on tangible, non-transitory storage mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
From the foregoing it will be appreciated that, although specific embodiments have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of this disclosure. For example, the methods, techniques, and systems for ability enhancement are applicable to other architectures or in other settings. For example, instead of providing threat information to human users who are vehicle operators or pedestrians, some embodiments may provide such information to control systems that are installed in vehicles and that are configured to automatically take action to avoid collisions in response to such information. In addition, the techniques are not limited just to road-based vehicles (e.g., cars, bicycles), but are also applicable to airborne vehicles, including unmanned aerial vehicles (e.g., drones). Also, the methods, techniques, and systems discussed herein are applicable to differing protocols, communication media (optical, wireless, cable, etc.) and devices (e.g., desktop computers, wireless handsets, electronic organizers, personal digital assistants, tablet computers, portable email machines, game machines, pagers, navigation devices, etc.).
The present application is related to and claims the benefit of the earliest available effective filing date(s) from the following listed application(s) (the “Related Applications”) (e.g., claims earliest available priority dates for other than provisional patent applications or claims benefits under 35 USC §119(e) for provisional patent applications, for any and all parent, grandparent, great-grandparent, etc. applications of the Related Application(s)). All subject matter of the Related Applications and of any and all parent, grandparent, great-grandparent, etc. applications of the Related Applications is incorporated herein by reference to the extent such subject matter is not inconsistent herewith. For purposes of the USPTO extra-statutory requirements, the present application constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/434,475, entitled PRESENTATION OF SHARED THREAT INFORMATION IN A TRANSPORTATION-RELATED CONTEXT, filed 29 Mar. 2012, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/309,248, entitled AUDIBLE ASSISTANCE, filed 1 Dec. 2011, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/324,232, entitled VISUAL PRESENTATION OF SPEAKER-RELATED INFORMATION, filed 13 Dec. 2011, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/340,143, entitled LANGUAGE TRANSLATION BASED ON SPEAKER-RELATED INFORMATION, filed 29 Dec. 2011, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/356,419, entitled ENHANCED VOICE CONFERENCING, filed 23 Jan. 2012, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/362,823, entitled VEHICULAR THREAT DETECTION BASED ON AUDIO SIGNALS, filed 31 Jan. 2012, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/397,289, entitled ENHANCED VOICE CONFERENCING WITH HISTORY, filed 15 Feb. 2012, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/407,570, entitled VEHICULAR THREAT DETECTION BASED ON IMAGE ANALYSIS, filed 28 Feb. 2012, which is incorporated herein by reference in its entirety. For purposes of the USPTO extra-statutory requirements, U.S. patent application Ser. No. 13/434,475 constitutes a continuation-in-part and is entitled to the filing date of U.S. patent application Ser. No. 13/425,210, entitled DETERMINING THREATS BASED ON INFORMATION FROM ROAD-BASED DEVICES IN A TRANSPORTATION-RELATED CONTEXT, filed 20 Mar. 2012, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 13434475 | Mar 2012 | US |
Child | 14819237 | US | |
Parent | 13309248 | Dec 2011 | US |
Child | 13434475 | US | |
Parent | 13324232 | Dec 2011 | US |
Child | 13309248 | US | |
Parent | 13340143 | Dec 2011 | US |
Child | 13324232 | US | |
Parent | 13356419 | Jan 2012 | US |
Child | 13340143 | US | |
Parent | 13362823 | Jan 2012 | US |
Child | 13356419 | US | |
Parent | 13397289 | Feb 2012 | US |
Child | 13362823 | US | |
Parent | 13407570 | Feb 2012 | US |
Child | 13397289 | US | |
Parent | 13425210 | Mar 2012 | US |
Child | 13407570 | US |