This relates generally to head-mounted devices, and, more particularly, to head-mounted devices with displays.
Some electronic devices such as head-mounted devices include video passthrough or see-through displays. A user may view a physical environment that includes text through the video passthrough or see-through displays.
It is within this context that the embodiments herein arise.
An electronic device may include one or more cameras, one or more displays, one or more processors, and memory storing instructions configured to be executed by the one or more processors, the instructions for obtaining a user input, determining an intent based on the user input, and in accordance with a determination that the user intent represents an intent to provide a summary of text in a physical environment, providing information regarding the text to a trained model, the information regarding the text obtained from one or more images of the physical environment captured using the one or more cameras, and presenting a text summary from the trained model on the one more displays.
Head-mounted devices may display different types of extended reality content for a user. The head-mounted device may display a virtual object that is perceived at an apparent depth within the physical environment of the user. Virtual objects may sometimes be displayed at fixed locations relative to the physical environment of the user. For example, consider an example where a user's physical environment includes a table. A virtual object may be displayed for the user such that the virtual object appears to be resting on the table. As the user moves their head and otherwise interacts with the XR environment, the virtual object remains at the same, fixed position on the table (e.g., as if the virtual object were another physical object in the XR environment). This type of content may be referred to as world-locked content (because the position of the virtual object is fixed relative to the physical environment of the user).
Other virtual objects may be displayed at locations that are defined relative to the head-mounted device or a user of the head-mounted device. First, consider the example of virtual objects that are displayed at locations that are defined relative to the head-mounted device. As the head-mounted device moves (e.g., with the rotation of the user's head), the virtual object remains in a fixed position relative to the head-mounted device. For example, the virtual object may be displayed in the front and center of the head-mounted device (e.g., in the center of the device's or user's field-of-view) at a particular distance. As the user moves their head left and right, their view of their physical environment changes accordingly. However, the virtual object may remain fixed in the center of the device's or user's field of view at the particular distance as the user moves their head (assuming gaze direction remains constant). This type of content may be referred to as head-locked content. The head-locked content is fixed in a given position relative to the head-mounted device (and therefore the user's head which is supporting the head-mounted device). The head-locked content may not be adjusted based on a user's gaze direction. In other words, if the user's head position remains constant and their gaze is directed away from the head-locked content, the head-locked content will remain in the same apparent position.
Second, consider the example of virtual objects that are displayed at locations that are defined relative to a portion of the user of the head-mounted device (e.g., relative to the user's torso). This type of content may be referred to as body-locked content. For example, a virtual object may be displayed in front and to the left of a user's body (e.g., at a location defined by a distance and an angular offset from a forward-facing direction of the user's torso), regardless of which direction the user's head is facing. If the user's body is facing a first direction, the virtual object will be displayed in front and to the left of the user's body. While facing the first direction, the virtual object may remain at the same, fixed position relative to the user's body in the XR environment despite the user rotating their head left and right (to look towards and away from the virtual object). However, the virtual object may move within the device's or user's field of view in response to the user rotating their head. If the user turns around and their body faces a second direction that is the opposite of the first direction, the virtual object will be repositioned within the XR environment such that it is still displayed in front and to the left of the user's body. While facing the second direction, the virtual object may remain at the same, fixed position relative to the user's body in the XR environment despite the user rotating their head left and right (to look towards and away from the virtual object).
In the aforementioned example, body-locked content is displayed at a fixed position/orientation relative to the user's body even as the user's body rotates. For example, the virtual object may be displayed at a fixed distance in front of the user's body. If the user is facing north, the virtual object is in front of the user's body (to the north) by the fixed distance. If the user rotates and is facing south, the virtual object is in front of the user's body (to the south) by the fixed distance.
Alternatively, the distance offset between the body-locked content and the user may be fixed relative to the user whereas the orientation of the body-locked content may remain fixed relative to the physical environment. For example, the virtual object may be displayed in front of the user's body at a fixed distance from the user as the user faces north. If the user rotates and is facing south, the virtual object remains to the north of the user's body at the fixed distance from the user's body.
Body-locked content may also be configured to always remain gravity or horizon aligned, such that head and/or body changes in the roll orientation would not cause the body-locked content to move within the XR environment. Translational movement may cause the body-locked content to be repositioned within the XR environment to maintain the fixed distance from the user. Subsequent descriptions of body-locked content may include both of the aforementioned types of body-locked content.
A schematic diagram of an illustrative head-mounted device is shown in
Head-mounted device 10 may include input-output circuitry 20. Input-output circuitry 20 may be used to allow data to be received by head-mounted device 10 from external equipment (e.g., a tethered computer, a portable device such as a handheld device or laptop computer, or other electrical equipment) and to allow a user to provide head-mounted device 10 with user input. Input-output circuitry 20 may also be used to gather information on the environment in which head-mounted device 10 is operating. Output components in circuitry 20 may allow head-mounted device 10 to provide a user with output and may be used to communicate with external electrical equipment.
As shown in
Display 32 may include one or more optical systems (e.g., lenses) that allow a viewer to view images on display(s) 16. A single display 32 may produce images for both eyes or a pair of displays 16 may be used to display images. In configurations with multiple displays (e.g., left and right eye displays), the focal length and positions of the lenses may be selected so that any gap present between the displays will not be visible to a user (e.g., so that the images of the left and right displays overlap or merge seamlessly). Display modules that generate different images for the left and right eyes of the user may be referred to as stereoscopic displays. The stereoscopic displays may be capable of presenting two-dimensional content (e.g., a user notification with text) and three-dimensional content (e.g., a simulation of a physical object such as a cube).
Input-output circuitry 20 may include various other input-output devices for gathering data and user input and for supplying a user with output. For example, input-output circuitry 20 may include one or more speakers 34 that are configured to play audio.
Input-output circuitry 20 may include one or more cameras 36. Cameras 36 may include one or more outward-facing cameras (that face the physical environment around the user when the electronic device is mounted on the user's head, as one example). Cameras 36 may capture visible light images, infrared images, or images of any other desired type. The cameras may be stereo cameras if desired. Outward-facing cameras may capture pass-through video for device 10. Cameras 22 may also include inward-facing cameras (e.g., for gaze detection).
Input-output circuitry 20 may include a gaze-tracker 40 (sometimes referred to as a gaze-tracking system or a gaze-tracking camera). The gaze-tracker 40 may be used to obtain gaze input from the user during operation of head-mounted device 10.
Gaze-tracker 40 may include a camera and/or other gaze-tracking system components (e.g., light sources that emit beams of light so that reflections of the beams from a user's eyes may be detected) to monitor the user's eyes. Gaze-tracker(s) 40 may face a user's eyes and may track a user's gaze. A camera in the gaze-tracking system may determine the location of a user's eyes (e.g., the centers of the user's pupils), may determine the direction in which the user's eyes are oriented (the direction of the user's gaze), may determine the user's pupil size (e.g., so that light modulation and/or other optical parameters and/or the amount of gradualness with which one or more of these parameters is spatially adjusted and/or the area in which one or more of these optical parameters is adjusted is adjusted based on the pupil size), may be used in monitoring the current focus of the lenses in the user's eyes (e.g., whether the user is focusing in the near field or far field, which may be used to assess whether a user is day dreaming or is thinking strategically or tactically), and/or other gaze information. Cameras in the gaze-tracking system may sometimes be referred to as inward-facing cameras, gaze-detection cameras, eye-tracking cameras, gaze-tracking cameras, or eye-monitoring cameras. If desired, other types of image sensors (e.g., infrared and/or visible light-emitting diodes and light detectors, etc.) may also be used in monitoring a user's gaze. The use of a gaze-detection camera in gaze-tracker 40 is merely illustrative.
As shown in
Using sensors 38, for example, control circuitry 14 can monitor the current direction in which a user's head is oriented relative to the surrounding environment (e.g., a user's head pose). In one example, position and motion sensors 38 may include one or more outward-facing cameras (e.g., that capture images of a physical environment surrounding the user). The outward-facing cameras may be used for face tracking (e.g., by capturing images of the user's jaw, mouth, etc. while the device is worn on the head of the user), body tracking (e.g., by capturing images of the user's torso, arms, hands, legs, etc. while the device is worn on the head of user), and/or for localization (e.g., using visual odometry, visual inertial odometry, or other simultaneous localization and mapping (SLAM) technique). In addition to being used for position and motion sensing, the outward-facing camera may capture pass-through video for device 10.
Input-output circuitry 20 may include one or more depth sensors 42. Each depth sensor may be a pixelated depth sensor (e.g., that is configured to measure multiple depths across the physical environment) or a point sensor (that is configured to measure a single depth in the physical environment). Camera images (e.g., from one of cameras 36) may also be used for monocular and/or stereo depth estimation. Each depth sensor (whether a pixelated depth sensor or a point sensor) may use phase detection (e.g., phase detection autofocus pixel(s)) or light detection and ranging (LIDAR) to measure depth. Any combination of depth sensors may be used to determine the depth of physical objects in the physical environment.
Input-output circuitry 20 may include a haptic output device 44. The haptic output device 44 may include actuators such as electromagnetic actuators, motors, piezoelectric actuators, electroactive polymer actuators, vibrators, linear actuators (e.g., linear resonant actuators), rotational actuators, actuators that bend bendable members, etc. The haptic output device 44 may be controlled to provide any desired pattern of vibrations.
Input-output circuitry 20 may also include other sensors and input-output components if desired (e.g., ambient light sensors, force sensors, temperature sensors, touch sensors, buttons, capacitive proximity sensors, light-based proximity sensors, other proximity sensors, strain gauges, gas sensors, pressure sensors, moisture sensors, magnetic sensors, microphones, light-emitting diodes, other light sources, wired and/or wireless communications circuitry, etc.).
Head-mounted device 10 may also include communication circuitry 56 to allow the head-mounted device to communicate with external equipment (e.g., a tethered computer, a portable electronic device, one or more external servers, or other electrical equipment). Communication circuitry 56 may be used for both wired and wireless communication with external equipment.
Communication circuitry 56 may include radio-frequency (RF) transceiver circuitry formed from one or more integrated circuits, power amplifier circuitry, low-noise input amplifiers, passive RF components, one or more antennas, transmission lines, and other circuitry for handling RF wireless signals. Wireless signals can also be sent using light (e.g., using infrared communications).
The radio-frequency transceiver circuitry in wireless communications circuitry 56 may handle wireless local area network (WLAN) communications bands such as the 2.4 GHz and 5 GHz Wi-Fi® (IEEE 802.11) bands, wireless personal area network (WPAN) communications bands such as the 2.4 GHz Bluetooth® communications band, cellular telephone communications bands such as a cellular low band (LB) (e.g., 600 to 960 MHz), a cellular low-midband (LMB) (e.g., 1400 to 1550 MHz), a cellular midband (MB) (e.g., from 1700 to 2200 MHz), a cellular high band (HB) (e.g., from 2300 to 2700 MHz), a cellular ultra-high band (UHB) (e.g., from 3300 to 5000 MHz, or other cellular communications bands between about 600 MHz and about 5000 MHz (e.g., 3G bands, 4G LTE bands, 5G New Radio Frequency Range 1 (FR1) bands below 10 GHz, etc.), a near-field communications (NFC) band (e.g., at 13.56 MHz), satellite navigations bands (e.g., an L1 global positioning system (GPS) band at 1575 MHz, an L5 GPS band at 1176 MHz, a Global Navigation Satellite System (GLONASS) band, a BeiDou Navigation Satellite System (BDS) band, etc.), ultra-wideband (UWB) communications band(s) supported by the IEEE 802.15.4 protocol and/or other UWB communications protocols (e.g., a first UWB communications band at 6.5 GHz and/or a second UWB communications band at 8.0 GHz), and/or any other desired communications bands.
The radio-frequency transceiver circuitry may include millimeter/centimeter wave transceiver circuitry that supports communications at frequencies between about 10 GHz and 300 GHz. For example, the millimeter/centimeter wave transceiver circuitry may support communications in Extremely High Frequency (EHF) or millimeter wave communications bands between about 30 GHz and 300 GHz and/or in centimeter wave communications bands between about 10 GHz and 30 GHz (sometimes referred to as Super High Frequency (SHF) bands). As examples, the millimeter/centimeter wave transceiver circuitry may support communications in an IEEE K communications band between about 18 GHz and 27 GHz, a Ka communications band between about 26.5 GHz and 40 GHz, a Ku communications band between about 12 GHz and 18 GHz, a V communications band between about 40 GHz and 75 GHz, a W communications band between about 75 GHz and 110 GHz, or any other desired frequency band between approximately 10 GHz and 300 GHz. If desired, the millimeter/centimeter wave transceiver circuitry may support IEEE 802.11ad communications at 60 GHz (e.g., WiGig or 60 GHz Wi-Fi bands around 57-61 GHz), and/or 5th generation mobile networks or 5th generation wireless systems (5G) New Radio (NR) Frequency Range 2 (FR2) communications bands between about 24 GHz and 90 GHz.
Antennas in wireless communications circuitry 56 may include antennas with resonating elements that are formed from loop antenna structures, patch antenna structures, inverted-F antenna structures, slot antenna structures, planar inverted-F antenna structures, helical antenna structures, dipole antenna structures, monopole antenna structures, hybrids of these designs, etc. Different types of antennas may be used for different bands and combinations of bands. For example, one type of antenna may be used in forming a local wireless link and another type of antenna may be used in forming a remote wireless link antenna.
During operation, head-mounted device 10 may use communication circuitry 56 to communicate with one or more external servers 60 through network(s) 58. Examples of communication network(s) 58 include local area networks (LAN) and wide area networks (WAN) (e.g., the Internet). Communication network(s) 58 may be implemented using any known network protocol, including various wired or wireless protocols, such as, for example, Ethernet, Universal Serial Bus (USB), FIREWIRE, Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
External server(s) 60 may be implemented on one or more standalone data processing apparatus or a distributed network of computers. External server 60 may include a trained model 62 such as a large language model (LLM). The trained model may provide a text summary to head-mounted device 10 in response to information from head-mounted device 10.
Consider an example in which the user of head-mounted device 10 is reading a book while operating head-mounted device 10. A page of the book including text may be visible to the viewer through one or more see-through displays 32 in head-mounted device 10. The user may identify a paragraph in the book by pointing at the paragraph or using other forms of user input. In response, head-mounted device 10 may perform optical character recognition (OCR) to obtain machine-readable text for the paragraph. The head-mounted device 10 may provide the machine-readable text to trained model 62. The trained model 62 may summarize the paragraph and send the text summary to head-mounted device 10. The text summary may be presented by head-mounted device 10 (e.g., adjacent to the paragraph of text in the book). In this way, head-mounted device 10 may display text summaries of text in the user's physical environment using information from trained model 62.
The example of communicating with external server(s) 60 to obtain a text summary from trained model 62 is merely illustrative. If desired, head-mounted device 10 may summarize text using a trained model that is stored on the head-mounted device 10 (e.g., in memory of control circuitry 14).
Trained model 62 may be a large language model. A large language model (LLM) is an artificial intelligence system designed to understand and generate human language text. Large language models (LLMs) belong to the broader category of natural language processing (NLP) models and have the ability to process and generate text that is coherent, contextually relevant, and grammatically accurate. Large language models may be built using deep learning techniques (e.g., neural networks) which enable them to learn patterns and associations within vast amounts of textual data. LLMs are trained on datasets containing billions of words, which allows them to capture the nuances and complexities of human language. Large language models are characterized by their large scale (e.g., having a least one hundred million parameters, having at least one billion parameters, at least ten billion parameters, at least one hundred billion parameters, etc.).
Trained model 62 may therefore sometimes be referred to as artificial intelligence (AI) system 62, language model 62, large language model 62, natural language processing model 62, etc. Trained model 62 may be configured to output human language text in response to input. Trained model 62 may include at least one billion parameters, at least ten billion parameters, at least one hundred billion parameters, etc.
It may be desirable to present a text summary that summarizes some or all of the text on physical object 72. Presenting the text summary may allow for the user of head-mounted device 10 to quickly absorb the information in the text. In the example of
There are numerous ways to trigger the display of a text summary for some or all text that is viewable in a user's physical environment. The user may provide user input to one or more user input components to place the head-mounted device in an assisted reading mode. In one possible arrangement, the head-mounted device may automatically present text summaries for text in the user's physical environment while in the assisted reading mode. In other words, a text summary may be presented for text in the physical environment whenever the text is detected in images captured by camera(s) 36.
Alternatively, when the head-mounted device is in the assisted reading mode, the user may provide user input to trigger the presentation of a text summary. In addition to triggering the presentation of the text summary, the user input may select a particular subset of text in the physical environment to summarize (e.g., the user may select the first paragraph in
The user input may include a button press. For example, the user may press a button on head-mounted device 10 to trigger a text summary to be displayed. The user may press the button again to trigger a text summary of a subsequent paragraph to be displayed. For example, the user may place the head-mounted device in the assisted reading mode. Subsequently, pressing the button a first time may cause a text summary of the first paragraph on physical object 72 to be displayed. Pressing the button a second time may cause a text summary of the second paragraph on physical object 72 to be displayed.
The user input may include a voice command. A microphone in head-mounted device 10 may be used to detect the voice command. The voice command may include commands such as “summarize,” “summarize the first paragraph,” summarize the next paragraph”, “next paragraph,” etc.
The user input may include a head gesture. The head gesture may include, as examples, an upward head tilt, a rightward head tilt, a nod, etc. The head gesture may trigger a text summary to be displayed. A head gesture may trigger a text summary of a subsequent paragraph to be displayed. For example, the user may place the head-mounted device in the assisted reading mode. Subsequently, an upward head tilt may cause a text summary of the first paragraph on physical object 72 to be displayed. A downward head tilt may cause a text summary of the paragraph below the currently summarized paragraph (e.g., the second paragraph on physical object 72) to be displayed. In examples where there are paragraphs side-by-side, a rightward head tilt may cause a text summary of the paragraph to the right of the currently summarized paragraph on physical object 72 to be displayed and a leftward head tilt may cause a text summary of the paragraph to the left of the currently summarized paragraph on physical object 72 to be displayed.
The user input may include a gaze gesture. The gaze gesture may include looking at a particular paragraph for longer than a threshold dwell time. For example, the user may place the head-mounted device in the assisted reading mode. Subsequently, the user may gaze at the first paragraph on physical object 72 for longer than the threshold dwell time, causing a text summary of the first paragraph to be displayed. Subsequently, the user may gaze at the second paragraph on physical object 72 for longer than the threshold dwell time, causing a text summary of the second paragraph to be displayed.
The user input may include a hand gesture such as a finger point. For example, the user may place the head-mounted device in the assisted reading mode. Subsequently, the user may point at the first paragraph on physical object 72 (e.g., at location 76 in
In addition to selecting a subset of text using user input, the user may adjust the length of text being summarized using user input. In the aforementioned examples, one paragraph of text in the physical environment is summarized at a time. The user may adjust the size of the text being summarized from one paragraph of text to one page of text, two paragraphs of text, a given number of lines of text, etc.
A visual indicator may be presented on display 32 to identify which subset of text in the physical environment is being summarized.
The example above of head-mounted device 10 being switched into an assisted reading mode is merely illustrative. In some cases the head-mounted device 10 may be operable in a normal mode in addition to the assisted reading mode. Head-mounted device 10 may still present text summaries while in the normal mode. For example, an additional user input may be required to cause a text summary to be presented when the head-mounted display is in the normal mode. Requiring the additional user input in the normal mode may avoid false positives in which text summaries are presented when not desired by the user. In the assisted reading mode, the number of user inputs required to cause a text summary to be presented may be lower than in the normal mode and/or the type of user inputs required to cause a text summary to be presented may be different (e.g., easier) than in the normal mode.
Some head-mounted devices may not have a dedicated assisted reading mode. Head-mounted device 10 may still generate text summaries in response to one or more user inputs (even when the device does not have a dedicated assisted reading mode). In this type of embodiment, the number and type of user inputs required to cause a text summary to be presented may be the same during all operation of head-mounted device 10.
In the example of
A wide range of information may be provided from head-mounted device 10 to trained model 62 in response to detecting a user intent to provide a summary of text. As shown in
As an example, in response to detecting a user intent to provide a summary of text, head-mounted device 10 may perform optical character recognition (OCR) on the text using images of the text captured using camera(s) 36. The OCR produces a machine-readable version of the text that is then provided to trained model 62.
As another example, in response to detecting a user intent to provide a summary of text, head-mounted device 10 may capture one or more images of the text using camera(s) 36. The one or more images may be provided directly to trained model 62. The trained model 62 may be capable of performing OCR on the text to obtain a machine-readable version of the text.
In addition to information regarding the text, head-mounted device 10 may provide contextual information to trained model 62. The contextual information may be used by trained model 62 to tailor the summary text. The contextual information may include information such as location information, cultural information, reading level information, age information, subject matter knowledge information, education information, historical query information, user preference information, temporal information, calendar information, etc.
The location information may include a city of residence of the user of head-mounted device 10, the current city in which head-mounted device 10 is located, etc. With all else being equal, the trained model 62 may provide a different text summary to a first user in a first location than a second user in a second location or may provide a different text summary to a first user with a first city of residence than a second user with a second, different city of residence. Current location information that is included in the location information provided to trained model 62 may be obtained using a global positioning system (GPS) sensor in head-mounted device 10, may be provided to head-mounted device 10 directly by the user, etc.
The cultural information may include a cultural background of the user of head-mounted device 10, cultural information associated with the current location of head-mounted device 10, etc. For example, for given text a user with a first cultural background may receive a first text summary that references a cultural custom associated with the first cultural background. A second user with a second cultural background that does not include the cultural custom may receive a second text summary for the given text that does not reference the cultural custom.
The reading level information may include an approximation of the reading level of the user of the head-mounted device. The complexity of the text summary output by trained model 62 may be adjusted based on the reading level of the user of the head-mounted device. For example, a first user with a fifth grade reading level may receive a first text summary for given text whereas a second user with an eighth grade reading level may receive a second text summary for the given text that is more complex than the first text summary.
The age information may include the age of the user of the head-mounted device. The complexity of the text summary output by trained model 62 may be adjusted based on the age of the user of the head-mounted device. For example, a first user of a first age may receive a first text summary for given text whereas a second user of a second age that is greater than the first age may receive a second text summary for the given text that is more complex than the first text summary.
The subject matter knowledge information may include details of particular subjects for which the user has a high or low knowledge base. Consider the example where the user of the head-mounted device 10 has an in-depth knowledge of biology and no prior exposure to computer science. This information may be identified in the subject matter knowledge information. In this example, text summaries output by trained model 62 related to biology may include more technical terms/details than if the user was not an expert in biology. Similarly, text summaries output by trained model 62 related to computer science may include less technical terms/details than if the user had prior exposure to computer science. A first user without expertise in a given subject matter may receive a first text summary for given text associated with the given subject matter whereas a second user with expertise in the given subject matter may receive a second text summary for the given text that is more complex than the first text summary.
The education information may include details of the user's educational history. Consider the example where the user of the head-mounted device 10 is a college graduate with a degree in biology. This information may be identified in the education information. In this example, text summaries output by trained model 62 related to biology may include more technical terms/details than if the user did not have a degree in biology.
The historical query information may include information regarding prior questions asked by the user and/or prior text that has been summarized by the user. For example, a user may request summaries of multiple paragraphs regarding a given historical event. Subsequently, the user may request a summary of a new paragraph. The trained model may receive historical query information identifying the previous summary requests associated with the given historical event. In view of this contextual information, the trained model may be more likely to reference the given historical event in a text summary of the new paragraph.
The user preference information may include any desired user preferences regarding the tone, length, and/or type of summary provided by trained model 62. The user preferences may be provided to head-mounted device 10 using one or more input components.
The temporal information may include information regarding the current time of day or the current time of year. With all else equal, the trained model may output a first text summary for given text at a first time of day and may output a second, different text summary for the given text at a second, different time of day. With all else equal, the trained model may output a first text summary for given text at a first time of year and may output a second, different text summary for the given text at a second, different time of year.
The calendar information may include information from the user's calendar such as upcoming appointments, travel, etc. For example, a user may be reading a guide book for a given location. If the calendar information identifies an upcoming vacation to the given location, the text summary output by the trained model may be different than if there was no upcoming vacation to the given location. For example, a user may be reading a textbook for a given subject. If the calendar information identifies an upcoming class for the given subject, the text summary output by the trained model may be different than if there was no upcoming class for the given subject.
It is noted that trained model 62 may infer additional contextual information based on the received pieces of contextual information. For example, the trained model 62 may infer reading level information and subject matter knowledge information based on received education information.
In addition to information regarding the text, head-mounted device 10 may provide one or more user questions to trained model 62. The user questions may include questions provided directly from the user to head-mounted device 10 using one or more input devices. For example, the user may ask questions that are detected using a microphone in head-mounted device 10. Instead or in addition, the user may provide a question using a keyboard (e.g., a physical keyboard associated with a computer, a touch-sensitive keyboard on a touch-sensitive display, a virtual keyboard, etc.). The question may address the paragraph being summarized. For example, the user might ask “what are the three most important points from this paragraph?” or “what argument is this paragraph making?” Trained model 62 may tailor the summary text based on the user question(s) received from head-mounted device 10.
In addition to information regarding the text, head-mounted device 10 may provide one or more response length parameters to trained model 62. The response length parameters may be used to control the length of the summary text provided by trained model 62. The response length parameters may include an absolute maximum, an absolute minimum, a relative maximum, and/or a relative minimum. The absolute maximum may be an absolute maximum word count for the text summary or an absolute maximum character count for the text summary. The relative maximum may be a relative maximum word count for the text summary or a relative maximum character count for the text summary. The absolute minimum may be an absolute minimum word count for the text summary or an absolute minimum character count for the text summary. The relative minimum may be a relative minimum word count for the text summary or a relative minimum character count for the text summary.
As one example, the response length parameters may include an absolute maximum of 100 words. In this case, the summary text output by trained model 62 may include a maximum of 100 words.
As another example, the response length parameters may include an absolute minimum of 25 words. In this case, the summary text output by trained model 62 may include a minimum of 25 words.
As another example, the response length parameters may include an absolute maximum of 500 characters. In this case, the summary text output by trained model 62 may include a maximum of 500 characters.
As another example, the response length parameters may include a relative maximum word count of 25%. In other words, the text summary may have a number of words that is no greater than 25% the number of words in the text being summarized. Consider an example where the text being summarized has 500 words. When the relative maximum word count is 25%, the summary text output by trained model 62 may include a maximum of 125 words. When the relative maximum word count is 10%, the summary text output by trained model 62 may include a maximum of 50 words.
As another example, the response length parameters may include a relative minimum word count of 5%. In other words, the text summary may have a number of words that is no less than 5% the number of words in the text being summarized. Consider an example where the text being summarized has 500 words. When the relative minimum word count is 5%, the summary text output by trained model 62 may include a minimum of 25 words. When the relative minimum word count is 10%, the summary text output by trained model 62 may include a minimum of 50 words.
As another example, the response length parameters may include a relative maximum character count of 25%. In other words, the text summary may have a number of characters that is no greater than 25% the number of characters in the text being summarized. Consider an example where the text being summarized has 2,500 characters. When the relative maximum character count is 20%, the summary text output by trained model 62 may include a maximum of 500 characters. When the relative maximum character count is 10%, the summary text output by trained model 62 may include a maximum of 250 characters.
Any of the aforementioned response length parameters may be inferred by the system and/or selected by the user. For example, contextual information may be used by control circuitry 14 to select or adjust one or more response length parameters. Instead or in addition, a user may provide user input to adjust one or more response length parameters.
In examples where multiple response length parameters are provided to trained model 62, the trained model may ensure that the summary text output to head-mounted device 10 is compliant with all of the response length parameters (e.g., adheres to an absolute maximum, a relative maximum, and an absolute minimum).
As one example, the response length parameters may be adjusted by the user using one or more user input devices. As one specific example, the user may adjust a slider to adjust the response length parameters. The slider may have one extreme at which the summary is capped at a small size (e.g., an absolute maximum of 50 words) and one extreme at which the summary is capped at a large size (e.g., an absolute maximum of 300 words). The slider may be adjusted between the two extremes to adjust the absolute maximum of the summary text.
It should be understood that the summary text output by trained model 62 is not a definition. Instead the summary text captures the main points, key ideas, and important details of the text being summarized. The summary text may usually be shorter than the text it is summarizing. However, if the response length parameter(s) allow, the summary text may sometimes be longer than the text being summarized. As an example, a short poem may be summarized using summary text that has more words and/or characters than the poem.
It should be understood that the summary text output by trained model 62 is not a translation. The summary text output by trained model 62 is the same language as the text being summarized. As examples, English text is summarized using English summary text, Spanish text is summarized using Spanish summary text, etc. If desired, a summary translation may be presented instead of summary text. As an example, Spanish text may be summarized using an English summary translation.
After receiving summary text from trained model 62, head-mounted device 10 may immediately present the summary text (e.g., as shown in
As a specific example, the speaker may play a chime in response to receiving the summary text. Using spatial audio, the chime may have an associated location relative to the user. The head-mounted device may present the summary text in response to the user looking in the direction of the chime.
As another specific example, the speaker may display a visual indicator in the periphery of the user's field of view in response to receiving the summary text. The head-mounted device may present the summary text in response to the user looking in the direction of the visual indicator.
As another specific example, the haptic output device may provide haptic output in response to receiving the summary text. The head-mounted device may present the summary text in response to the user pressing a button after the haptic notification is presented.
During the operations of block 104, head-mounted device 10 (e.g., control circuitry 14) may determine an intent based on the user input. The intent may be, for example, an intent to provide a summary of text in a physical environment.
During the operations of block 106, in accordance with a determination that the user intent represents an intent to provide a summary of text in a physical environment, head-mounted device 10 (e.g., control circuitry 14) may provide information regarding the text to a trained model. As previously discussed, the trained model may be a large language model (LLM) with at least one billion parameters.
Providing information regarding the text to the trained model may include, as shown by the operations of block 108, performing optical character recognition (OCR) on the text using one or more images captured by camera(s) 36. After performing the OCR on the text to generate machine-readable text at block 108, the machine-readable text may be provided to the trained model.
Providing information regarding the text to the trained model may include, as shown by the operations of block 110, providing one or more images captured by camera(s) 36 to the trained model. The trained model may use the one or more images for OCR operations to generate machine-readable text that is then used to generate a text summary.
The operations of block 106 may also include selecting a subset of the text in the physical environment to provide the trained model, as shown in the operations of block 112. The subset of text may be selected based on standing user preferences regarding the length of text to be summarized. For example, the user may have set a preference to summarize one paragraph of text, one page of text, two paragraphs of text, a given number of lines of text, etc.
The user may also provide user input to indicate which portion of the text is intended for the text summary. The user may indicate the desired subset of text using a gaze gesture (e.g., dwelling on a particular portion of the text), a head gesture, a voice command, a hand gesture (e.g., pointing a finger at a particular portion of the text), and/or a button press.
During the operations of block 114, the head-mounted device 10 (e.g., control circuitry 14) may present an indicator on the one or more displays 32 that identifies the subset of the text being summarized. An example of a visual indicator (e.g., a bracket) is shown in
During the operations of block 116, the head-mounted device 10 (e.g., control circuitry 14) may provide contextual information to the trained model. The contextual information may include information such as location information, cultural information, reading level information, age information, subject matter knowledge information, education information, historical query information, user preference information, temporal information, calendar information, etc.
During the operations of block 118, the head-mounted device 10 (e.g., control circuitry 14) may provide one or more response length parameters to the trained model. The response length parameters may be used to control the length of the summary text provided by trained model 62. The response length parameters may include an absolute maximum, a relative maximum, an absolute minimum, and/or a relative minimum. The response length parameters may be adjusted or selected based on contextual information such as location information, cultural information, reading level information, age information, subject matter knowledge information, education information, historical query information, user preference information, temporal information, calendar information, etc. Instead or in addition, the response length parameters may be adjusted at any time based on user input to one or more input components in head-mounted device 10. If the user adjusts the response length parameter(s) while a text summary is being presented, the text summary may be updated to reflect the new response length parameter(s).
During the operations of block 120, the head-mounted device 10 (e.g., control circuitry 14) may provide a user question to the trained model. The user question may be provided directly from the user to head-mounted device 10 using one or more input devices. For example, the user may ask questions that are detected using a microphone in head-mounted device 10. Instead or in addition, the user may provide a question using a keyboard (e.g., a physical keyboard associated with a computer, a touch-sensitive keyboard on a touch-sensitive display, a virtual keyboard, etc.).
During the operations of block 122, the head-mounted device 10 (e.g., control circuitry 14) may present a notification in accordance with receiving a text summary from the trained model. The notification may include a visual notification presented using display 32, an audio notification presented using speaker 36, and/or a haptic notification presented using haptic output device 44.
During the operations of block 124, the head-mounted device 10 (e.g., control circuitry 14) may present a text summary from the trained model on the one or more displays 32. The text summary may be world-locked, body-locked, or head-locked. The text summary may be shorter (e.g., in word count and/or character count) than the text being summarized. The text summary may be the same language as the text being summarized.
Presenting the text summary may include selecting a position for the text summary on the one or more displays 32 based on the one or more images of the physical environment captured by camera(s) 36. In possible arrangement (shown in
Presenting the text summary during the operations of block 124 may include presenting the text summary after the notification is presented at block 122 and in response to additional user input confirming the user intent to present the text summary.
At block 124, instead of or in addition to presenting the text summary, the text summary may be audibly presented to the user by speaker 34. In other words, the text summary may be played aloud by speaker 34 in parallel with presenting the text summary on display 32 or instead of presenting the text summary on display 32.
It is noted that any or all of the operations of blocks 114, 116, 118, 120, 122, and 124 may be performed in accordance with a determination that the user intent represents an intent to provide a summary of text in a physical environment.
In general, the trained model may be stored in control circuitry 14 or stored in one or more external servers (e.g., external server(s) 60 from
Consider an example where a user is reading a book that is viewable through the see-through display(s) 32 of head-mounted device 10. During the operations of block 102, the user may point to a paragraph in the book with their finger. This gesture is detected using camera(s) 36 during the operations of block 102.
During the operations of block 104, control circuitry 14 may determine that the intent of the finger point is for the head-mounted device to provide a summary of the paragraph being pointed to.
During the operations of blocks 106, 108, and 112, in accordance with the determination that the user intent represents an intent to provide a summary of text in the physical environment, head-mounted device 10 may select a subset of the text in the physical environment (e.g., the paragraph being pointed to), perform optical character recognition on the subset of the text, and wirelessly provide a machine-readable version of the subset of the text to a trained model (e.g., a large language model).
During the operations of block 114, control circuitry 14 may present a visual indicator (e.g., a bracket or reticle) on display(s) 32 identifying the paragraph that is intended to be summarized.
During the operations of block 116, control circuitry 14 wirelessly provides contextual information to the trained model. In this example, the contextual information includes the user's current location (as determined using a GPS sensor in head-mounted device 10) and educational information for the user.
During the operations of block 118, control circuitry 14 wirelessly provides one or more response length parameters to the trained model. In this example, the response length parameters include an absolute minimum of 50 words and an absolute maximum of 150 words. These parameters ensure that the text summary received by the trained model will be between 50 words and 150 words.
During the operations of block 120, control circuitry 14 wirelessly provides a user question to the trained model. In this example, the user question (“what is the main argument of this paragraph?”) is provided audibly to a microphone in the head-mounted device and then communicated to the trained model by communication circuitry 56.
After the operations of blocks 116, 118, and 120, head-mounted device 10 may wirelessly receive a text summary from the trained model. The text summary may be based on the information regarding the text from block 106, the contextual information from block 116, the response length parameter from block 118, and the user question from block 120.
After receiving the text summary, head-mounted device 10 presents a visible notification during the operations of block 122. The visible notification identifies that the text summary is ready for viewing.
Finally, during the operations of block 124, the text summary from the trained model is presented on display(s) 32. In this example, the text summary is presented at a position that does not overlap the text in the physical environment (e.g., as in
It is noted that, although the techniques described have been described in connection with text, the techniques may also be applicable to pictures, graphs, etc. For example, a user may point to a picture in their physical environment. The head-mounted device may send information regarding the picture (e.g., an image of the picture captured by camera 36) instead of information regarding text to a trained model and may receive a text summary of the picture from the trained model that is based on the information regarding the picture, contextual information, response length parameters, and/or user questions.
As another example, a user may point to a graph in their physical environment. The head-mounted device may send information regarding the graph (e.g., an image of the graph captured by camera 36) instead of information regarding text to a trained model and may receive a text summary of the graph from the trained model that is based on the information regarding the graph, contextual information, response length parameters, and/or user questions.
Additionally, the summary from the trained model may optionally include a graph or chart instead of or in addition to text. A user may point to text in their physical environment. The text may describe a trend. Instead of or in addition to a text summary, the head-mounted device may receive a graph from trained model 62 that summarizes the trend described in the text.
The techniques described above may also be used in a head-mounted device with an opaque display that presents a passthrough video feed of the user's physical environment.
The techniques described herein may be applicable to audio feedback in addition to visual text feedback. As an example, an electronic device such as a head-mounted device, cellular telephone, tablet computer, laptop computer, etc. may have a digital assistant that is capable of answering questions and/or requests (e.g., as detected using a microphone in head-mounted device 10) from the user. The digital assistant may provide audio feedback (e.g., using speaker 34) to the user that is determined using a trained model (similar to as in block 124 of
The example in
As described above, one aspect of the present technology is the gathering and use of information such as sensor information. The present disclosure contemplates that in some instances, data may be gathered that includes personal information data that uniquely identifies or can be used to contact or locate a specific person. Such personal information data can include demographic data, location-based data, telephone numbers, email addresses, twitter ID's, home addresses, data or records relating to a user's health or level of fitness (e.g., vital signs measurements, medication information, exercise information), date of birth, username, password, biometric information, or any other identifying or personal information.
The present disclosure recognizes that the use of such personal information, in the present technology, can be used to the benefit of users. For example, the personal information data can be used to deliver targeted content that is of greater interest to the user. Accordingly, use of such personal information data enables users to have control of the delivered content. Further, other uses for personal information data that benefit the user are also contemplated by the present disclosure. For instance, health and fitness data may be used to provide insights into a user's general wellness, or may be used as positive feedback to individuals using technology to pursue wellness goals.
The present disclosure contemplates that the entities responsible for the collection, analysis, disclosure, transfer, storage, or other use of such personal information data will comply with well-established privacy policies and/or privacy practices. In particular, such entities should implement and consistently use privacy policies and practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining personal information data private and secure. Such policies should be easily accessible by users, and should be updated as the collection and/or use of data changes. Personal information from users should be collected for legitimate and reasonable uses of the entity and not shared or sold outside of those legitimate uses. Further, such collection/sharing should occur after receiving the informed consent of the users. Additionally, such entities should consider taking any needed steps for safeguarding and securing access to such personal information data and ensuring that others with access to the personal information data adhere to their privacy policies and procedures. Further, such entities can subject themselves to evaluation by third parties to certify their adherence to widely accepted privacy policies and practices. In addition, policies and practices should be adapted for the particular types of personal information data being collected and/or accessed and adapted to applicable laws and standards, including jurisdiction-specific considerations. For instance, in the United States, collection of or access to certain health data may be governed by federal and/or state laws, such as the Health Insurance Portability and Accountability Act (HIPAA), whereas health data in other countries may be subject to other regulations and policies and should be handled accordingly. Hence different privacy practices should be maintained for different personal data types in each country.
Despite the foregoing, the present disclosure also contemplates embodiments in which users selectively block the use of, or access to, personal information data. That is, the present disclosure contemplates that hardware and/or software elements can be provided to prevent or block access to such personal information data. For example, the present technology can be configured to allow users to select to “opt in” or “opt out” of participation in the collection of personal information data during registration for services or anytime thereafter. In another example, users can select not to provide certain types of user data. In yet another example, users can select to limit the length of time user-specific data is maintained. In addition to providing “opt in” and “opt out” options, the present disclosure contemplates providing notifications relating to the access or use of personal information. For instance, a user may be notified upon downloading an application (“app”) that their personal information data will be accessed and then reminded again just before personal information data is accessed by the app.
Moreover, it is the intent of the present disclosure that personal information data should be managed and handled in a way to minimize risks of unintentional or unauthorized access or use. Risk can be minimized by limiting the collection of data and deleting data once it is no longer needed. In addition, and when applicable, including in certain health related applications, data de-identification can be used to protect a user's privacy. De-identification may be facilitated, when appropriate, by removing specific identifiers (e.g., date of birth, etc.), controlling the amount or specificity of data stored (e.g., collecting location data at a city level rather than at an address level), controlling how data is stored (e.g., aggregating data across users), and/or other methods.
Therefore, although the present disclosure broadly covers use of information that may include personal information data to implement one or more various disclosed embodiments, the present disclosure also contemplates that the various embodiments can also be implemented without the need for accessing personal information data. That is, the various embodiments of the present technology are not rendered inoperable due to the lack of all or a portion of such personal information data.
The foregoing is merely illustrative and various modifications can be made to the described embodiments. The foregoing embodiments may be implemented individually or in any combination.
This application claims the benefit of U.S. provisional patent application No. 63/586,281 filed Sep. 28, 2023, U.S. provisional patent application No. 63/598,688 filed Nov. 14, 2023, and U.S. provisional patent application No. 63/550,969 filed Feb. 7, 2024, which are hereby incorporated by reference herein in their entireties.
Number | Date | Country | |
---|---|---|---|
63586281 | Sep 2023 | US | |
63598688 | Nov 2023 | US | |
63550969 | Feb 2024 | US |