Augmented reality (AR) is a technology that superimposes virtual content on a user's view of the real world, thereby providing a composite view. A conventional AR device may convert speech to text, and the written transcription may be rendered in a predetermined portion of the display.
In some aspects, the techniques described herein relate to a method including: detecting an ambient sound from audio data captured by a plurality of microphones on a display device; determining, based on the audio data, a location of a sound source of the ambient sound; generating contextual information about the ambient sound based on an audio segment of the audio data that includes the ambient sound; and displaying, by the display device, the contextual information based on the location of the sound source.
In some aspects, the techniques described herein relate to a display device including: at least one processor; and a non-transitory computer-readable medium storing executable instructions that cause the at least one processor to: detect ambient noise and speech from audio data captured by a plurality of microphones on the display device; determine, based on the audio data, a first location of a first sound source of the ambient noise and a second location of a second sound source of the speech; generate first contextual information about the ambient noise based on a first audio segment of the audio data that includes the ambient noise; generate second contextual information about the speech based on a second audio segment of the audio data that includes the speech; display, by the display device, the first contextual information based on the first location of the first sound source; and display, by the display device, the second contextual information based on the second location of the second sound source.
In some aspects, the techniques described herein relate to a non-transitory computer-readable medium storing executable instructions that when executed by at least one processor cause the at least one processor to execute operations, the operations including: detecting an ambient sound from audio data captured by a plurality of microphones on a display device; determining, based on the audio data, a location of a sound source of the ambient sound; generating contextual information about the ambient sound based on an audio segment of the audio data that includes the ambient sound; and displaying, by the display device, the contextual information based on the location of the sound source.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
This disclosure relates to a technical solution for a display device with a contextual information display engine that displays contextual information about one or more detected ambient sounds. In some examples, the display device is a head-mounted display device such as an augmented reality (AR) device or a virtual reality (VR) device. The display device provides a technical solution for enabling a display device with perception-enhanced directional capabilities for ambient sounds, including determining a directionality of the ambient sound (e.g., there is a honking car to your right), labeling the source of information (e.g., Matt is speaking), overlaying background audio sources into the user's experience (e.g., there is a child crying upstairs), and/or labeling attribute data (e.g., a 22-year woman is speaking on your right, saying [insert translation]). In some examples, the display device provides a technical effect for hearing-impaired users such as converting auditory sounds into visual cues for the user.
The contextual information display engine may detect speech (e.g., a type of ambient sound) and display a label with attribute data about the speaker (e.g., a twelve year-old girl is speaking) in a location that corresponds to a location of the speaker. In some examples, the contextual information display engine may display a textual transcription of the speech in a location that corresponds to the location of the speaker. In some examples, in addition to (or instead of), the contextual information display engine may detect ambient noise (e.g., a type of ambient sound), generate attribute data about the ambient noise (e.g., a car is honking), and display contextual information with the attribute data in a location that corresponds to a location of the sound source. The contextual information display engine may detect an ambient sound from audio data captured by a plurality of microphones, identifies a location (e.g., a direction and/or distance) of sound source of the ambient sound, generates attribute data about the ambient sound, and renders contextual information (e.g., a label) with the attribute data in a location that is based on the location of the ambient sound.
The display device 100 includes a plurality of microphones 102. A microphone 102 may be a transducer that converts sound waves into electrical signals. In examples, the microphones 102 are arranged in an array to form a set of beamforming microphones. In some examples, the number of microphones 102 is equal to or four. The microphones 102 may generate audio data 104 (e.g., audio signals).
The contextual information display engine 106 includes a sound processing engine 108 configured to receive the audio data 104 captured by the microphones 102. The sound processing engine 108 may detect whether the audio data 104 includes an ambient sound 110, and, if so, may derive an audio segment 112 that includes the ambient sound 110. The audio segment 112 is a portion of the audio data 104 that includes the ambient sound 110. In some examples, the sound processing engine 108 may filter and remove background noise from the audio segment 112. In some examples, the sound processing engine 108 may detect multiple ambient sounds 110 from the audio data 104. The ambient sounds 110 may include an ambient sound 110-1 and an ambient sound 110-2. The ambient sound 110-1 is generated by a sound source 142-1. The ambient sound 110-2 is generated by a sound source 142-2. In some examples, the sound source 142-2 is located in a different (separate) location from the sound source 142-1 (e.g., a spatially different location). In some examples, the sound source 142-1 and/or the sound source 142-2 are located in an image frame of the 3D scene 140 that is within the field of view of the camera device(s) 135 (e.g., currently viewed by the user). In some examples, the sound source 142-1 and/or the sound source 142-2 are located in an image frame of the 3D scene 140 that is outside the field of view of the camera device(s) 135. In some examples, the ambient sound 110-2 at least partially overlaps with the ambient sound 110-1 (e.g., temporally overlaps). The sound processing engine 108 may derive an audio segment 112-1 that includes the ambient sound 110-1, and an audio segment 112-2 that includes the ambient sound 110-2.
The sound processing engine 108 detects a type 116 of the ambient sound 110. A type 116 may be a classification of the ambient sound 110. The sound processing engine 108 may detect the type 116 as at least one of ambient noise 110a or speech 110b. Although
In some examples, the sound processing engine 108 detects the type 116 of the ambient sound 110-1 as ambient noise 110a. Ambient noise 110a may be noise that is not human speech such as a car honking, birds chirping, a police car alarm, etc. In some examples, the sound processing engine 108 detects the type 116 of the ambient sound 110-2 as speech 110b. Speech 110b may be human speech.
The sound processing engine 108 may determine a location 114 of a sound source 142 of each detected ambient sound 110. For example, the sound processing engine 108 may determine a location 114-1 of sound source 142-1 based on the audio segment 112-1 that includes the ambient sound 110-1. The location 114 may be a direction 170 and/or a distance 172 of the sound source 142-1 from a point of reference (e.g., the user). The sound processing engine 108 may determine a location 114-2 of sound source 142-2 based on the audio segment 112-2 that includes the ambient sound 110-2. The sound processing engine 108 may include one or more machine-learning (ML) models 115.
The contextual information display engine 106 includes a label generator 120 configured to receive the output of the sound processing engine 108 and generate contextual information 130 for each detected ambient sound 110, where the contextual information 130 includes information about a detected ambient sound 110. The label generator 120 includes a sound source identifier 122 configured to generate attribute data 128 about an ambient sound 110 based on an audio segment 112. In some examples, as shown in
In some examples, the label generator 120 includes a speech-to-text engine 124 configured to generate a textual translation 188 of the speech 110b. In some examples, if the type 116 is speech 110b, the contextual information display engine 106 initiates the speech-to-text engine 124 to generate the textual translation 188 using the audio segment 112 as an input. In some examples, if the type 116 is speech 110b, the contextual information display engine 106 initiates the sound source identifier 122 to generate the attribute data 128 about the speech 110b. For example, the sound source identifier 122 generates attribute data 128 about the speech 110b based on the audio segment 112-2. If the ambient sound 110-2 is a child speaking, the attribute data 128 may be textual information that describes the speech 110b such as “a young child is speaking the following.” In some examples, if the type of speech 110b, the contextual information display engine 106 only initiates the speech-to-text engine 124 but not the sound source identifier 122.
The label generator 120 may generate contextual information 130 for each detected ambient sound 110. For example, the label generator 120 may generate contextual information 130-1 for the ambient sound 110-1, and contextual information 130-2 for the ambient sound 110-2. As shown in
The contextual information display engine 106 may display the contextual information 130 is a location in the 3D scene 140 that is based on (or corresponds to) a location of a sound source 142. For example, the contextual information display engine 106 may display the contextual information 130-1 at a location in the 3D scene 140 that is based on (or corresponds) to a location 114-1 of a sound source 142-1, and the contextual information display engine 106 may display the contextual information 130-2 at a location in the 3D scene 140 that is based on (or corresponds) to a location 114-2 of a sound source 142-2.
The display device 100 may include one or more processors 101, one or more memory devices 103, and an operating system 105 configured to execute one or more applications 107. In some examples, the display device 100 is a head-mounted display device. In some examples, the display device 100 includes an AR device (e.g., a device that superimposes virtual content on a user's view of the real world, thereby providing a composite view). In some examples, the display device 100 includes smart glasses. In some examples, the display device 100 includes a VR device, where the camera device(s) 135 are world-facing cameras that capture the user's surroundings, and the VR device generates virtual content that corresponds to the imagery captured by the camera device(s) 135. In some examples, the display device 100 is another type of user device such as a smartphone, a laptop, desktop, gaming console, or a television device.
The processor(s) 101 may be formed in a substrate configured to execute one or more machine executable instructions or pieces of software, firmware, or a combination thereof. The processor(s) 101 can be semiconductor-based-that is, the processors can include semiconductor material that can perform digital logic. The memory device(s) 103 may include any type of storage device that stores information in a format that can be read and/or executed by the processor(s) 101. In some examples, the memory device(s) 103 is/are a non-transitory computer-readable medium. In some examples, the memory device(s) 103 includes a non-transitory computer-readable medium that includes executable instructions that cause at least one processor (e.g., the processor(s) 101) to execute operations discussed with reference to the display device 100. The applications 107 may be any type of computer program that can be executed by the display device 100, including native applications that are installed on the operating system 105 by the user and/or system applications that are pre-installed on the operating system 105.
Operation 402 includes detecting an ambient sound from audio data captured by a plurality of microphones on a display device. Operation 404 includes determining, based on the audio data, a location of a sound source of the ambient sound. Operation 406 includes generating contextual information about the ambient sound based on a segment of the audio data that includes the ambient sound. Operation 408 includes displaying, by the display device, the contextual information based on the location of the sound source.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
In this specification and the appended claims, the singular forms “a,” “an” and “the” do not exclude the plural reference unless the context clearly dictates otherwise. Further, conjunctions such as “and,” “or,” and “and/or” are inclusive unless the context clearly dictates otherwise. For example, “A and/or B” includes A alone, B alone, and A with B. Further, connecting lines or connectors shown in the various figures presented are intended to represent example functional relationships and/or physical or logical couplings between the various elements. Many alternative or additional functional relationships, physical connections or logical connections may be present in a practical device. Moreover, no item or component is essential to the practice of the implementations disclosed herein unless the element is specifically described as “essential” or “critical”.
Terms such as, but not limited to, approximately, substantially, generally, etc. are used herein to indicate that a precise value or range thereof is not required and need not be specified. As used herein, the terms discussed above will have ready and instant meaning to one of ordinary skill in the art.
Moreover, use of terms such as up, down, top, bottom, side, end, front, back, etc. herein are used with reference to a currently considered or illustrated orientation. If they are considered with respect to another orientation, it should be understood that such terms must be correspondingly modified.
Although certain example methods, apparatuses and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. It is to be understood that terminology employed herein is for the purpose of describing particular aspects and is not intended to be limiting. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.