Embodiments of the present invention generally relate to augmented reality-assisted collaboration, and more specifically to AI-driven augmented reality-assisted mentoring and collaboration.
Augmented reality (AR) is a real-time view of a physical, real-world environment whose elements are “augmented” by computer-generated sensory input such as sound, video, graphics and positioning data. A display of a real-world environment is enhanced by augmented data pertinent to a use of an augmented reality device. For example, mobile devices provide augmented reality applications allowing users to view their surrounding environment through the camera of the mobile device, while the mobile device determines the location of the device based on global positioning satellite (GPS) data, triangulation of the device location, or other positioning methods. These devices then overlay the camera view of the surrounding environment with location based data such as local shops, restaurants and move theaters as well as the distance to landmarks, cities and the like.
Current methods for AR systems require a great deal of manual effort. Most existing solutions cannot be used in new/unknown environments. These methods need to scan the environment with sensors (such as cameras) to capture objects of a scene that need to be identified by an expert such that a feature database can be constructed. In such current systems, a database needs to be manually annotated with the information about the location of different components of the environment. An AR provider then manually specifies a coordinate according to the landmark database of the target environment for inserting a virtual object/character inside the real scene for AR visualization. Such manual authoring could take several hours based on a number of objects/components in a scene. Further, in current systems an annotated database is only valid for a specific make and model.
Embodiments of the present principles provide methods, apparatuses and systems for AI-driven augmented reality-assisted mentoring and collaboration.
In some embodiments, a method for AI-driven augmented reality mentoring and collaboration includes determining semantic features of objects in at least one captured scene using a deep learning algorithm to identify the objects in the at least one scene, determining 3D positional information of the objects in the at least one captured scene, combining information regarding the identified objects of the at least one captured scene with respective 3D positional information to determine at least one intermediate representation of the at least one scene, which provides information regarding positions of the identified objects in the at least one captured scene and spatial relationships among the identified objects, completing the determined at least one intermediate representation using machine learning to include additional objects or positional information of the objects not identifiable from the at least one captured scene, determining at least one task to be performed and determining steps to be performed related to the identified at least one task using a knowledge database comprising data relating to respective steps to be performed for different tasks, generating at least one visual representation relating to the determined steps to be performed for the at least one task to assist the at least one user in performing the at least one task, determining a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the at least one user using information in the at least one completed intermediate representation, and displaying the at least one visual representation on the see-through display in the determined correct position as an augmented overlay to the view of the at least one user to guide the at least one user to perform the at least one task.
In some embodiments, the at least one user comprises two or more users and received and determined information is shared among the two or more users such that a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the two or more users is determined using information in at least one completed scene graph related to either one of the two or more users.
In some embodiments, 3D positional information of the objects is determined using at least one of data received from a sensor capable of capturing depth information of a scene or image-based methods, monocular image based depth estimation, multi-frame structure from motion imagery or 3d sensors.
In some embodiments, determining a correct position for displaying the at least one visual representation further includes determining an intermediate representation for the generated at least one visual representation which provides information regarding positions of objects in the at least one visual representation and spatial relationships among the objects, and comparing the determined intermediate representation of the generated at least one visual representation with the at least one intermediate representation of the at least one scene to determine how closely the objects of the visual representation align with the objects of the at least one scene.
In some embodiments, a task to be performed can be determined by generating a scene understanding of the at least one captured scene based on an automated analysis of the at least one captured scene, wherein the at least one captured scene comprises a view of a user during performance of a task related to the identified at least one object in the at least one captured scene.
In some embodiments, the intermediate representation comprises a scene graph.
In some embodiments, the method can further include analyzing actions of the user during the performance of a step of the task by using information related to a next step of the task, wherein, if the user has not completed the next step of the task, new visual representations are created to be generated and presented as an augmented overlay to guide the user to complete the performance of the next step of the task, and if the user has completed the next step of the task and a subsequent step of the task exists, new visual representations are created to be generated and presented as an overlay to guide the user to complete the performance of the subsequent step of the task.
In some embodiments, the at least one captured scene includes both video data and audio data, the video data comprising a view of the user of a real-world scene during performance of a task and the audio data comprising speech of the user during performance of the task, and wherein the steps relating to the performance of the task are further determined using at least one of the video data or the audio data.
In some embodiments, an apparatus for AI-driven augmented reality mentoring and collaboration includes a processor, and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions. In some embodiments, upon execution of the at least one of programs or instructions by the processor, the apparatus is configured to determine semantic features of objects in at least one captured scene using a deep learning algorithm to identify the objects in the at least one scene, determine 3D positional information of the objects in the at least one captured scene, combine information regarding the identified objects of the at least one captured scene with respective 3D positional information to determine at least one intermediate representation of the at least one scene, which provides information regarding positions of the identified objects in the at least one captured scene and spatial relationships among the identified objects, complete the determined at least one intermediate representation using machine learning to include additional objects or positional information of the objects not identifiable from the at least one captured scene, determine at least one task to be performed and determining steps to be performed related to the identified at least one task using a knowledge database comprising data relating to respective steps to be performed for different tasks, generate at least one visual representation relating to the determined steps to be performed for the at least one task to assist the at least one user in performing the at least one task, determine a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the at least one user using information in the at least one completed intermediate representation, and display the at least one visual representation on the see-through display in the determined correct position as an augmented overlay to the view of the at least one user to guide the at least one user to perform the at least one task.
In some embodiments a non-transitory computer readable storage medium has stored thereon instructions that when executed by a processor perform a method for AI-driven augmented reality mentoring and collaboration, the method including determining semantic features of objects in at least one captured scene using a deep learning algorithm to identify the objects in the at least one scene, determining 3D positional information of the objects in the at least one captured scene, combining information regarding the identified objects of the at least one captured scene with respective 3D positional information to determine at least one intermediate representation of the at least one scene, which provides information regarding positions of the identified objects in the at least one captured scene and spatial relationships among the identified objects, completing the determined at least one intermediate representation using machine learning to include additional objects or positional information of the objects not identifiable from the at least one captured scene, determining at least one task to be performed and determining steps to be performed related to the identified at least one task using a knowledge database comprising data relating to respective steps to be performed for different tasks, generating at least one visual representation relating to the determined steps to be performed for the at least one task to assist the at least one user in performing the at least one task, determining a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the at least one user using information in the at least one completed intermediate representation, and displaying the at least one visual representation on the see-through display in the determined correct position as an augmented overlay to the view of the at least one user to guide the at least one user to perform the at least one task.
In some embodiments a method for AI-driven augmented reality mentoring and collaboration for two or more users includes determining semantic features of objects in at least one captured scene associated with two or more users using a deep learning algorithm to identify the objects in the at least one captured scene, determining 3D positional information of the objects in the at least one captured scene, combining information regarding the identified objects of the at least one captured scene with respective 3D positional information of the objects to determine at least one intermediate representation of the at least one scene, which provides information regarding positions of the identified objects in the at least one captured scene and spatial relationships among the identified objects, completing the determined at least one intermediate representation using machine learning to include at least additional objects or additional positional information of the objects not identifiable from the at least one captured scene, determining at least one task to be performed and determining steps to be performed related to the identified at least one task using a knowledge database comprising data relating to respective steps to be performed for different tasks, generating at least one visual representation relating to the determined steps to be performed for the at least one task to assist the at least one user in performing the at least one task, determining a correct position for displaying the at least one visual representation on a respective see-through display of the two or more users as an augmented overlay to the view of the two or more users using information in the at least one completed intermediate representation, and displaying the at least one visual representation on the respective see-through displays in the determined correct position as an augmented overlay to the view of the two or more users to guide the two or more users to perform the at least one task, individually or in tandem.
Various advantages, aspects and features of the present disclosure, as well as details of an illustrated embodiment thereof, are more fully understood from the following description and drawings.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. The figures are not drawn to scale and may be simplified for clarity. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
Embodiments of the present principles generally relate to artificial intelligence (AI)-driven augmented reality (AR) mentoring and collaboration. While the concepts of the present principles are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are described in detail below. It should be understood that there is no intent to limit the concepts of the present principles to the particular forms disclosed. On the contrary, the intent is to cover all modifications, equivalents, and alternatives consistent with the present principles and the appended claims. For example, although embodiments of the present principles will be described primarily with respect to specific collaborations, such as the maintenance and repair of automobiles, embodiments of the present principles can be applied to the mentoring of and collaboration with users via an AI-driven, AR-assisted method, system and apparatus for the completion of substantially any task.
Embodiments of the present principles can automatically insert visual objects/characters in a scene based on scene contexts with known relationships among objects in, for example, a new/unknown environment. In some embodiments, the process becomes automatic based on prior knowledge/rules/instructions, such as inserting humans on real chairs. In some embodiments, the scene contexts can be automatically analyzed based on semantic scene graph technologies. No manual effort to build a database is needed beforehand,
The guidance/instructions/training provided by embodiments of the present principles can be related to any kind of task, for instance, using a device, or repairing machines or performing a certain task in a given environment. For example, embodiments of the present principles can provide operational guidance and training for augmented reality collaboration including the maintenance and repair of devices and systems, such as automobiles, to a user via two modes of support: (1) Visual Support-Augmented Reality (AR)-assisted graphical instructions on User's display, and (2) Audio Support-Virtual Personal Assistant (VPA), verbal instructions. In some embodiments of the present principles, a machine learning (ML) network (e.g., CNN-based ML network) is implemented for semantic segmentation of objects in a captured scene and for learning a scenegraph representation of components of the scene to guide the learning process towards accurate relative localization of semantic objects/components. Alternatively or in addition, in some embodiments of the present principles, efficient training of scene models can also be accomplished using determined Scene Graphs.
In accordance with the present principles, each of the objects of a scene captured by at least one sensor can be identified by a scene module of the present principles (described below). Advantageously, an AI-driven AR mentoring and collaboration system of the present principles, such as the AI-driven AR mentoring and collaboration system 100 of
In the embodiment of the AI-driven AR mentoring and collaboration system 100 of
In the embodiment of the AI-driven AR mentoring and collaboration system 100 of
In some embodiments of the present principles, a machine learning system of the present principles, such as the machine learning system 141 of the scene module 101 of the AI-driven AR mentoring and collaboration system 100 of
The machine learning system 141 of the scene module 101 of the AI-driven AR mentoring and collaboration system 100 of
In accordance with the present principles, depth information of a captured scene is determined. In some embodiments of the AI-driven AR mentoring and collaboration system of the present principles, such as the AI-driven AR mentoring and collaboration system 100 of
Alternatively or in addition, in some embodiments of the present principles, depth information of a scene (e.g., scene 153) can be determined using image-based methods such as monocular image-based methods including depth estimation of images, multi-frame structure from motion imagery and depth, and/or other 3d sensors. The depth information can be used by the scene module 101 to determine a point cloud of a captured scene.
In accordance with the present principles, the scene graph module 142 can use the semantic features determined by the semantic segmentation module 140 and the depth information, such as 3D object position information of for example a determined point cloud, to determine a scene graph representation of captured scenes, which include node and edge level features of the objects in the scene. That is, in some embodiments the scene graph module 142 can determine a scene graph of each captured scene.
Scene graph representations serve as a powerful way of representing image content in a simple graph structure. A scene graph consists of a heterogeneous graph in which the nodes represent objects or regions in an image and the edges represent relationships between those objects. A determined scene graph of the present principles provides both the dimensions of each object, as well as the location of the object including spatial relationship to other identified objects represented in the scene graph. More specifically, in embodiments of the present principles, the scene graph module 142 of the scene module 101 combines information of the determined semantic features of the scene with respective 3D positional information (e.g., point cloud(s)), for example, using a machine learning system (e.g., in some embodiments, neural networks) to determine a representation of the scene which provides information regarding positions of the identified objects in the scene and spatial relationships among the identified objects. An embodiment of such a process is described in commonly owned, pending U.S. patent application Ser. No. 17/554,661, filed Dec. 17, 2021, which is herein incorporated by reference in its entirety. For example, in some embodiments the machine learning system 141 of the scene module 101 can be implemented to determine a representation of the scene which provides information regarding positions of the identified objects in the scene and spatial relationships among the identified objects as described in pending U.S. patent application Ser. No. 17/554,661.
For example,
As further depicted in the embodiment of
In some embodiments of the present principles, the identity and location of all of the objects/components identified by the scene module 101 can be used to identify a device or system (i.e., automobile), which the objects/components comprise and/or can be used to determine how to perform actions/repairs on the identified objects/components and/or an identified device. For example, in some embodiments the identity and location of all of the identified objects/components can be compared with components and locations of previously known devices and systems, which in some embodiments can be stored in a storage device, such as the knowledge database 108 of
In some embodiments of the present principles, if the type and location of all of the objects/components identified by the scene module 101 do not align with the type and location of the object/components of any known device and/or system stored in the knowledge base 108, an AR mentoring and collaboration system of the present principles, such as the AR mentoring and collaboration system 100 of
For example, an AR mentoring and collaboration system of the present principles, such as the AR mentoring and collaboration system 100 of
In some embodiments, a task to be performed can be determined from the user request or any other language captured from the scene 153. For example, in some embodiments, the language module 104 of the understanding module 123 of the AR mentoring and collaboration system 100 of
Alternatively or in addition, in some embodiments of the present principles, a task to be performed can be determined automatically using captured scene data, which is described in greater detail below with respect to a description of the determination of user intent in accordance with an embodiment of the present principles.
Once a task to be performed has been determined/identified in accordance with the various embodiments of the present principles described herein, a database can be searched to determine if instructions/steps exist for performing the identified task. For example, in some embodiments, the task mission understanding module 106 of the understanding module 123 of the AR mentoring and collaboration system 100 of
In some embodiments of the present principles, the reasoning module 110 receives relative task information and determines which task or step(s) of a task has priority in completion and reasons a next step based on the priority. The output from the reasoning module 110 can be input to the augmented reality generator 112 and, in some embodiments (further described below) the speech generator 114. The AR generator 112 creates display content that takes into account at least one scene graph determined for a captured scene (e.g., the scene 153) and or a global scene graph and, in some embodiments, a user perspective (as described below). The AR generator 112 can update a display the user sees in real-time as the user performs tasks, completes tasks, steps, and moves on to different tasks, and transitions from one environment to the next.
In accordance with the present principles, the AR mentoring and collaboration system of the present principles can assist the user in, for example, cooking the specific dish in the identified environment. For example, the AR mentoring and collaboration system of the present principles can generate an AR image of the ingredients of the specific dish on the identified countertop. The AR mentoring and collaboration system of the present principles can then generate an AR image of a specific ingredient on a cutting board on the countertop and also generate an AR image of a knife chopping the ingredient if required. The AR mentoring and collaboration system of the present principles can further generate an image of a boiling pot of water on the identified stove and an AR image of a placement of at least one of the ingredients into the boiling pot of water, and so on. That is, the AR mentoring and collaboration system of the present principles can generate AR images to instruct the user on how to cook the specific dish according to the retrieved task steps/instructions.
In some embodiments of the present principles, an AR mentoring and collaboration system of the present principles, such as the AR mentoring and collaboration system 100 of
For example,
For further explanation,
As depicted in the embodiment of
Similarly, in the embodiment of
In summary, scene graphs determined for captured scenes in accordance with the present principles can be completed, as described above using a machine learning process of the machine learning system 141 to include additional objects or positional information of the objects not identifiable from the at least one captured scene. Information from scene graphs completed in accordance with the present principles can be used to determine a correct position for displaying determined AR content on a see-through display as an augmented overlay to the view of at least one user. That is, if the original scene graph determined for captured scene content does not include an object near/on which AR content is to be displayed, completing the scene graph in accordance with the present principles can provide additional object and/or positioning information of objects not determinable/identifiable from the original scene graph to be able to more accurately determine a correct position for displaying determined AR content on a see-through display as an augmented overlay to the view of at least one user. That is, in some embodiments of the present principles, a correct position for displaying determined AR content can be determined by generating a scene graph for the AR content, which provides information regarding positions of objects in AR content and spatial relationships among the objects and comparing the determined scene graph of the AR content with the scene graph of a captured scene to determine how closely the objects of the visual representation align with the objects of the at least one scene.
In some embodiments of the present principles, an AR mentoring and collaboration system of the present principles, such as the AR mentoring and collaboration system 100 of
In accordance with the present principles and as described above, once a task is determined, steps for performing the task can be identified using information stored in the database 180. As further described above, the reasoning module 110 can receive relative task information and determine which task or step(s) of a task has priority in completion and reasons a next step based on the priority. The output from the reasoning module 110 can be input to the augmented reality generator 112 and, in some embodiments (further described below) the speech generator 114. The AR generator 112 creates AR display content that takes into account at least one scene graph determined for a captured scene (e.g., the scene 153) and or a global scene graph and, in some embodiments, a user perspective (as described below). The AR generator 112 can update a display the user sees in real-time as the user performs tasks, completes tasks, steps, and moves on to different tasks, and transitions from one environment to the next.
In some embodiments of the present principles, the AR generator 112 can implement a matching process to determine/verify a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the two or multiple users. That is, in some embodiments, a scene graph can be determined for AR content generated by the AR generator 112 in accordance with the present principles. In such embodiments, the AR scene graph can be determined and compared by the AR generator 112 (or alternatively by the scene module 101) to a scene graph determined for a captured scene, such as the scene 153. The determined AR content can then be adjusted based on how closely the AR scene graph matches the scene graph determined for a captured scene in which the AR content is to be displayed. For example, if determined AR content includes a man to be inserted on a chair near a tv and a table, and a remote to be placed on the table, an AR scene graph of the man and the remote can be compared to a scene graph of the chair, the tv, and the table to determine how closely the AR man and remote align with the scene graph of the chair, the tv, and the table. In some embodiments, a similarity score can be determined from the comparison and a threshold can be established such that a result of a comparison wherein AR content having a similarity score over the threshold can be displayed on the scene and AR content having a similarity score below the threshold will not be displayed on the scene (or vice versa).
With respect back to
The language module 104 performs natural language processing on the received audio feed, augmenting the scene understanding generated by the scene module 101. The language module 104 can include a real-time dialog and reasoning system that supports human-like interaction using spoken natural language. The language module 104 can be based on automated speech recognition, natural language understanding, and reasoning. The language module 104 recognizes the user's goals and provides feedback through the speech generator 114, discussed below. The feedback and interaction with a user can occur both audibly and by causing the AI-driven AR mentoring and collaboration system 100 to display icons and text visually on a user's display.
The function of the understanding module 123 of
The correlation module 102 correlates the scene and language data (if any exists) together, stores the scene and language data in the database 108 and correlates the data into a user state 105, which according to some embodiments comprises a model of user intent.
In some embodiments, the task mission understanding module 106 receives the user state 105 as input and generates a task understanding 107. The task understanding 107 is a representation of a set of goals 109 that the user is trying to achieve, based on the user state 105 and the scene understanding in the scene and language data. A plurality of task understandings can be generated by the task mission understanding module 106, where the plurality of tasks form a workflow ontology. The goals 109 are a plurality of goals which can include a hierarchy of goals, or, a task ontology, that must be completed for a task understanding to be considered complete. Each goal can have parent-goals, sub-goals, and so forth. According to some embodiments, there are pre-stored task understandings that a user can implement, such as “perform oil change”, “check fluids” or the like, for which a task understanding does not have to be generated, only retrieved.
The task understanding 107 is coupled to the reasoning module 110 as an input. The reasoning module 110 processes the task understanding 107, along with task ontologies and workflow models from the database 108, and reasons about the next step in an interactive dialog that the AI-driven AR mentoring and collaboration system 100 needs to interact with the user to achieve the goals 109 of the task understanding 107. According to some embodiments, hierarchical action models are used to define tasking cues relative to the workflow ontologies that are defined. In some embodiments of the present principles, the reasoning module 110 determines which goal or sub-goal has priority in completion and reasons a next step based on the priority.
The output from the reasoning module 110 is input to the augmented reality generator 112 and the speech generator 114. The AR generator 112 created display content that takes the world model and user perspective from the at least one sensor 103 into account (i.e., task ontologies, next steps, display instructions, apparatus overlays, and the like, are modeled over the three-dimensional model of a scene stored in database 108 according to the user's perspective, as described in commonly owned U.S. Pat. No. 9,213,558, issued Dec. 15, 2015 and U.S. Pat. No. 9,082,402, issued Jul. 14, 2015, both of which are incorporated by reference in their entirety herein. The AR generator 112 can update a display the user sees in real-time as the user performs tasks, completes, tasks, goals, moves on to different tasks, and transitions from one environment to the next.
In the AI-driven AR mentoring and collaboration system 100 of
In some embodiments, the optional performance module 120 actively analyzes the user's performance in following task ontologies, completing workflows, goals, and the like. The performance module 120 can then also output display updates and audio updates to the AR generator 112 and the speech generator 114. The performance module 120 can also interpret user actions against the task the user is attempting to accomplish. This, in turn, feeds the reasoning module 110 on next actions or verbal cues to present to the user.
In the embodiment of
In some embodiments, the recognition module 406 uses the information generated by the localization module 408 to generate a model for user gaze 436 as well as the objects 430 and the tools 432 within the user's field of regard.
In the embodiment of the understanding module 123 of
In the embodiment of
In the embodiment of the understanding module 123 of
In the embodiment of the understanding module 123 of
In the embodiment of the understanding module 123 of
In the understanding module 123 of the AI-driven AR mentoring and collaboration system 100 of
For example, a workflow 500 of the TMUM 106 in accordance with one embodiment of the present principles is shown in
Referring back to
For example, a user might look at a particular object and say “where do I put this?” The scene module 101 identifies the location of objects in the scene and direction that the user is looking at (e.g., at a screwdriver), and the language module 104 identifies that the user is asking a question to locate the new position of an object but neither component has a complete understanding of user's real goal. By merging information generated by individual modules, an AI-driven AR mentoring and collaboration system of the present principles will determine that the user is “asking a question to locate the new position of a specific screwdriver”. Furthermore, most of the time, it is not enough to understand only what the user said in the last utterance but also important to interpret that utterance in a given context of recent speech and scene data. In the running example, depending on the task the user is trying to complete, the question in the utterance might be referring to a “location for storing the screwdriver” or a “location for inserting the screwdriver into another object.”
The TMUM 106 of the understanding module 123 of
In some embodiments of the present principles, Hidden Markov Models (HMM) can be used to model the transitions of the finite-state machine that represents the task workflow 500 of
The detailed architecture of the reasoning module 110 as shown in
This step is represented as an AR mentor “Intent”, and can encode dialog for the speech generator 114 to generate, actions or changes within the UI, both of those, or even neither of those (i.e., take no action). The reasoning module 110 of
The reasoning module 110 can track events being observed in a heads-up display, determines the best modality to communicate a concept to the user of the heads-up display, dynamically composes multimodal (UI and language) “utterances”, manages the amount of dialog vs. the amount of display changes in the interaction, and the like. According to one embodiment, AR mentor “Intents” also accommodate robust representation of a variety of events recognized by the recognition module 406 depicted in
The reasoning module 110 can further initiate dialogs based on exogenous events (“exogenous” in the sense that they occur outside the user-mentor dialog), which can include the current assessment of the AI-driven AR mentoring and collaboration system 100 of an ongoing operation/maintenance process it is monitoring by extending a “proactive offer” functionality, and enhance the representation of the input it uses to make next-step decisions.
In some embodiments, the AR generator 112 of the AI-driven AR mentoring and collaboration system 100 of
In the embodiment of the AR generator 112 of
The responses from the NLG 1004 can be customized according to the user as well as the state of the simulated interaction, such as the training, repair operation, maintenance, etc. The speech generator 114 can optionally take advantage of external speech cues, language cues and other cues coming from the scene to customize the responses. In various cases, the NLG 1004 leverages visual systems such as AR and a user interface on a display to provide the most natural response. As an example, the NLG 1004 can output “Here is the specific component” and use the AR generator 112, as depicted in
In the embodiment of
In some embodiments of the present principles, clip on sensor packages (not shown) can be utilized to reduce weight. In some embodiments, the video sensor can comprise an ultra-compact USB2.0 camera from XIMEA (MU9PC_HM) with high resolution and sensitivity for AR, with a 5.7×4.28 mm footprint. Alternatively, a stereo sensor and light-weight clip-on bar structure can be used for the camera. The IMU sensor can be an ultra-compact MEMs IMU (accelerometer, gyro) developed by INERTIAL LABS that also incorporates a 3 axis magnetometer. In an alternate embodiment, the XSENS MTI-G SENSOR, which incorporates a GPS, can be used as the IMU sensor.
In the embodiment of
At 1204, 3D positional information of the objects in the at least one captured scene is determined using depth information of the objects in the captured scene. As described above, in some embodiments depth information can be determined using a sensor capable of providing depth information for captured images and alternatively or in addition, in some embodiments of the present principles, depth information of a scene can be determined using image-based methods such as monocular image-based methods including depth estimation of images. The depth information can be used to determine 3D positional information of the objects in the at least one captured scene. The method 1200 can proceed to 1206.
At 1206, information regarding the identified objects of the at least one captured scene is combined with respective 3D positional information to determine at least one scene graph representation of the at least one scene, which provides information regarding positions of the identified objects in the at least one captured scene and spatial relationships among the identified objects. The method can proceed to 1208.
At 1208, the determined at least one scene graph representation is completed using machine learning to include additional objects or positional information of the objects not identifiable from the at least one captured scene. The method 1200 can proceed to 1210.
At 1210, at least one task to be performed is determined and steps to be performed related to the identified at least one task are determined using a knowledge database comprising data relating to respective steps to be performed for different tasks. As described above, in some embodiments a task to be performed can be determined from language received in a user collaboration request. Alternatively or in addition, in some embodiments, a task to be performed can be determined by generating a scene understanding of the at least one captured scene based on an automated analysis of the at least one captured scene, wherein the at least one captured scene comprises a view of a user during performance of a task related to the identified at least one object in the at least one captured scene. The method 1200 can proceed to 1212.
At 1212, at least one visual representation relating to the determined steps to be performed is generated for the at least one task to assist the at least one user in performing the at least one task related to the collaboration request. The method 1200 can proceed to 1214.
At 1214, a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the at least one user is determined using the completed scene graph. The method 1200 can proceed to 1216.
At 1216, the at least one visual representation is displayed on the see-through display in the determined correct position as an augmented overlay to the view of the at least one user to guide the at least one user to perform the at least one task. The method 1200 can be exited.
In some embodiments, the method can further include extracting one or more visual cues from the at least one captured scene to situate the user in relation to the identified at least one object or the device, wherein the user is situated by tracking a head orientation of the user, and wherein the at least one visual representation is rendered based on a predicted head pose of the user based on the tracked head orientation of the user.
In some embodiments, the method can further include analyzing actions of the user during the performance of a step of the task by using information related to a next step of the task, wherein, if the user has not completed the next step of the task, new visual representations are created to be generated and presented as an augmented overlay to guide the user to complete the performance of the next step of the task, and wherein, if the user has completed the next step of the task and a subsequent step of the task exists, new visual representations are created to be generated and presented as an overlay to guide the user to complete the performance of the subsequent step of the task.
In some embodiments of the method, the at least one captured scene includes both video data and audio data, the video data comprising a view of the user of a real-world scene during performance of a task and the audio data comprising speech of the user during performance of the task, and wherein the steps relating to the performance of the task are further determined using at least one of the video data or the audio data.
In some embodiments, an AI-driven AR mentoring and collaboration system of the present principles can include two or more users working in conjunction, in which received and determined information is shared among the two or more users, wherein a correct position for displaying the at least one visual representation on a see-through display as an augmented overlay to the view of the two or more users is determined using information in at least one completed scene graph.
As depicted in
In different embodiments, the computing device 1300 can be any of various types of devices, including, but not limited to, a personal computer system, desktop computer, laptop, notebook, tablet or netbook computer, mainframe computer system, handheld computer, workstation, network computer, a camera, a set top box, a mobile device, a consumer device, video game console, handheld video game device, application server, storage device, a peripheral device such as a switch, modem, router, or in general any type of computing or electronic device.
In various embodiments, the computing device 1300 can be a uniprocessor system including one processor 1310, or a multiprocessor system including several processors 1310 (e.g., two, four, eight, or another suitable number). Processors 1310 can be any suitable processor capable of executing instructions. For example, in various embodiments processors 1310 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs). In multiprocessor systems, each of processors 1310 may commonly, but not necessarily, implement the same ISA.
System memory 1320 can be configured to store program instructions 1322 and/or, in some embodiments, machine learning systems that are accessible by the processor 1310. In various embodiments, system memory 1320 can be implemented using any suitable memory technology, such as static random-access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing any of the elements of the embodiments described above can be stored within system memory 1320. In other embodiments, program instructions and/or data can be received, sent or stored upon different types of computer-accessible media or on similar media separate from the system memory 1320 or the computing device 1300.
In one embodiment, I/O interface 1330 can be configured to coordinate I/O traffic between processor 1310, system memory 1320, and any peripheral devices in the device, including network interface 1340 or other peripheral interfaces, such as input/output devices 1350. In some embodiments, I/O interface 1330 can perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 1320) into a format suitable for use by another component (e.g., processor 1310). In some embodiments, I/O interface 1330 can include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 1330 can be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 1330, such as an interface to system memory 1320, can be incorporated directly into processor 1310.
Network interface 1340 can be configured to allow data to be exchanged between the computing device 1300 and other devices attached to a network (e.g., network 1390), such as one or more external systems or between nodes of the computing device 1300. In various embodiments, network 1390 can include one or more networks including but not limited to Local Area Networks (LANs) (e.g., an Ethernet or corporate network), Wide Area Networks (WANs) (e.g., the Internet), wireless data networks, some other electronic data network, or some combination thereof. In various embodiments, network interface 1340 can support communication via wired or wireless general data networks, such as any suitable type of Ethernet network, for example; via digital fiber communications networks; via storage area networks such as Fiber Channel SANs, or via any other suitable type of network and/or protocol.
Input/output devices 1350 can, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or accessing data by one or more computer systems. Multiple input/output devices 1350 can be present in computer system or can be distributed on various nodes of the computing device 1300. In some embodiments, similar input/output devices can be separate from the computing device 1300 and can interact with one or more nodes of the computing device 1300 through a wired or wireless connection, such as over network interface 1340.
Those skilled in the art will appreciate that the computing device 1300 is merely illustrative and is not intended to limit the scope of embodiments. In particular, the receiver/control unit and peripheral devices can include any combination of hardware or software that can perform the indicated functions of various embodiments, including computers, network devices, Internet appliances, PDAs, wireless phones, pagers, and the like. The computing device 1300 can also be connected to other devices that are not illustrated, or instead can operate as a stand-alone system. In addition, the functionality provided by the illustrated components can in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided and/or other additional functionality can be available.
The computing device 1300 can communicate with other computing devices based on various computer communication protocols such a Wi-Fi, Bluetooth® (and/or other standards for exchanging data over short distances includes protocols using short-wavelength radio transmissions), USB, Ethernet, cellular, an ultrasonic local area communication protocol, etc. The computing device 1300 can further include a web browser.
Although the computing device 1300 is depicted as a general purpose computer, the computing device 1300 is programmed to perform various specialized control functions and is configured to act as a specialized, specific computer in accordance with the present principles, and embodiments can be implemented in hardware, for example, as an application specified integrated circuit (ASIC). As such, the process steps described herein are intended to be broadly interpreted as being equivalently performed by software, hardware, or a combination thereof.
In the network environment 1400 of
In some embodiments in accordance with the present principles, an AI-driven AR mentoring and collaboration system in accordance with the present principles can be located in a single and/or multiple locations/servers/computers to perform all or portions of the herein described functionalities of a system in accordance with the present principles. For example, in some embodiments some components of a AI-driven AR mentoring and collaboration system of the present principles can be located in one or more than one of the user domain 1402, the computer network environment 1406, and the cloud environment 1410 for providing the functions described above either locally or remotely.
Those skilled in the art will also appreciate that, while various items are illustrated as being stored in memory or on storage while being used, these items or portions of them can be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other embodiments some or all of the software components can execute in memory on another device and communicate with the illustrated computer system via inter-computer communication. Some or all of the system components or data structures can also be stored (e.g., as instructions or structured data) on a computer-accessible medium or a portable article to be read by an appropriate drive, various examples of which are described above. In some embodiments, instructions stored on a computer-accessible medium separate from a computing device can be transmitted to the computing device via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link. Various embodiments can further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium or via a communication medium. In general, a computer-accessible medium can include a storage medium or memory medium such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g., SDRAM, DDR, RDRAM, SRAM, and the like), ROM, and the like.
The methods and processes described herein may be implemented in software, hardware, or a combination thereof, in different embodiments. In addition, the order of methods can be changed, and various elements can be added, reordered, combined, omitted or otherwise modified. All examples described herein are presented in a non-limiting manner. Various modifications and changes can be made as would be obvious to a person skilled in the art having benefit of this disclosure. Realizations in accordance with embodiments have been described in the context of particular embodiments. These embodiments are meant to be illustrative and not limiting. Many variations, modifications, additions, and improvements are possible. Accordingly, plural instances can be provided for components described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and can fall within the scope of claims that follow. Structures and functionality presented as discrete components in the example configurations can be implemented as a combined structure or component. These and other variations, modifications, additions, and improvements can fall within the scope of embodiments as defined in the claims that follow.
In the foregoing description, numerous specific details, examples, and scenarios are set forth in order to provide a more thorough understanding of the present disclosure. It will be appreciated, however, that embodiments of the disclosure can be practiced without such specific details. Further, such examples and scenarios are provided for illustration, and are not intended to limit the disclosure in any way. Those of ordinary skill in the art, with the included descriptions, should be able to implement appropriate functionality without undue experimentation.
References in the specification to “an embodiment,” etc., indicate that the embodiment described can include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is believed to be within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly indicated.
Embodiments in accordance with the disclosure can be implemented in hardware, firmware, software, or any combination thereof. Embodiments can also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors. A machine-readable medium can include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “virtual machine” running on one or more computing devices). For example, a machine-readable medium can include any suitable form of volatile or non-volatile memory.
In addition, the various operations, processes, and methods disclosed herein can be embodied in a machine-readable medium and/or a machine accessible medium/storage device compatible with a data processing system (e.g., a computer system), and can be performed in any order (e.g., including using means for achieving the various operations). Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. In some embodiments, the machine-readable medium can be a non-transitory form of machine-readable medium/storage device.
Modules, data structures, and the like defined herein are defined as such for ease of discussion and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data structures can be combined or divided into sub-modules, sub-processes or other units of computer code or data as can be required by a particular design or implementation.
In the drawings, specific arrangements or orderings of schematic elements can be shown for ease of description. However, the specific ordering or arrangement of such elements is not meant to imply that a particular order or sequence of processing, or separation of processes, is required in all embodiments. In general, schematic elements used to represent instruction blocks or modules can be implemented using any suitable form of machine-readable instruction, and each such instruction can be implemented using any suitable programming language, library, application-programming interface (API), and/or other software development tools or frameworks. Similarly, schematic elements used to represent data or information can be implemented using any suitable electronic arrangement or data structure. Further, some connections, relationships or associations between elements can be simplified or not shown in the drawings so as not to obscure the disclosure.
While the foregoing is directed to embodiments of the present principles, other and further embodiments of the invention can be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This application claims priority to and benefit of U.S. Provisional Patent Application Ser. No. 63/408,036 filed Sep. 19, 2022, which is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63408036 | Sep 2022 | US |