Instruction manuals, instructional videos, and other instructional content is typically stored as unstructured data. Unstructured data may include various data types. For example, a digital instruction manual may include various combinations of text, photographs, videos, diagrams, etc.
The inventors have recognized that because unstructured documents may not follow a consistent schema, it may be difficult to extract structured data following a specified schema from them. But it is often desirable to extract structured data following a specified schema from unstructured data. For example, unstructured data of an instruction manual may be converted into structured data including a series of steps for a user to perform in a mixed reality application. Extracting structured data from unstructured data is often done manually or using expensive traditional data extraction techniques. These traditional methods may introduce errors, take a significant amount of time, require significant computation, or limit the amount of structured data that may be produced for use in mixed reality (MR) applications.
Modern computing and display technologies have facilitated the development of systems for so called “virtual reality” or “augmented reality” experiences, in which digitally reproduced images or portions thereof are presented to a user in a manner that simulates interaction with the physical world. A virtual reality, or “VR”, scenario typically involves the presentation of digital or virtual image information without transparency to other actual real-world visual input; an augmented reality, or “AR”, scenario typically involves presentation of digital or virtual image information as an augmentation to visualization of the actual world around the user. A mixed reality, or “MR”, scenario is a type of AR scenario and typically involves virtual objects (artifacts) that are integrated into, and responsive to, the natural world. For example, in an MR scenario, a virtual artifact may be occluded by real world objects and/or be perceived as interacting with other objects (virtual or real) in the real world. Throughout this disclosure, reference to AR, VR or MR is not limiting on the invention and the techniques may be applied to any context.
Furthermore, innovations in machine learning technology have facilitated the development of generative artificial intelligence models including large language models (LLMs) such as generative pre-trained transformer (GPT) 3, 3.5 and 4, generative adversarial networks, recurrent neural networks, reinforcement learning models, variational autoencoders, etc. In general, a generative artificial intelligence model is trained to generate content in response to a prompt.
LLMs like GPT 4 operate on natural language and may be capable of generating output responsive to a variety of prompts, including prompts specifying a format for the output to follow. For example, an LLM may take as input a natural language prompt such as “write a haiku about birds.” The LLM may then produce as output a natural language haiku about birds.
The inventors have recognized that it may be useful to generate structured data for use in mixed reality applications from unstructured data using a large language model. For example, it may be helpful to extract a sequence of steps from an instruction manual and use the sequence of steps to construct a mixed reality instructional application based on the steps.
The inventors have further recognized that conventional techniques for extracting structured data from unstructured data for use in mixed reality applications limit the feasibility of creating mixed reality applications. In particular, they have recognized that development of mixed reality applications based on unstructured documents is hindered by the reliance of conventional techniques on manual data extraction.
In response to recognizing these disadvantages, the inventors have conceived and reduced to practice a software and/or hardware facility for generating structured data for mixed reality applications using generative artificial intelligence (“the facility”).
The facility receives unstructured instructional content, such as an instructional manual or other forms of unstructured instructional content. The facility also receives input specifying a first schema to which the unstructured instructional content is to conform. The facility creates a system task prompt based on the unstructured instructional content and the input. Then, the facility submits the system task prompt to a generative artificial intelligence model. The facility receives proposed structured data from the generative artificial intelligence model and parses the proposed structured data according to the first schema. In some embodiments, in response to successfully parsing the proposed structured data, the facility converts the proposed structured data to conform to a second schema usable in mixed reality applications. Text associated with a discrete step in a mixed reality application may be displayed using the mixed reality experience. Other information associated with the step may also be displayed, such as pictures, videos, etc. For example, in a mixed reality experience instructing a user how to assemble a drone, text instructing the user to perform an assembly step may be displayed, or the text may be provided in speech via text-to-speech. Video, pictures, etc., instructing the user how to perform the assembly step may also be displayed.
By performing in some or all of the ways described above, the facility automatically generates structured data for use in mixed reality applications based on unstructured data, saving human effort and improving the ultimate quality of the mixed reality applications. Also, the facility improves the functioning of computer or other hardware, such as by reducing the dynamic display area, processing, storage, and/or data transmission resources needed to perform a certain task, thereby enabling the task to be permitted by less capable, capacious, and/or expensive hardware devices, and/or be performed with lesser latency, and/or preserving more of the conserved resources for use in performing other tasks. For example, by automatically generating the structured data, the facility obviates the use of processing resources to display and service a user interface for creating and editing manually generated structured data. By using a generative artificial intelligence model to extract structured data from unstructured data, the facility avoids traditional computationally expensive techniques for extracting structured data from unstructured data such as various image processing techniques.
Further, for at least some of the domains and scenarios discussed herein, the processes described herein as being performed automatically by a computing system cannot practically be performed in the human mind, for reasons that include that the starting data, intermediate state(s), and ending data are too voluminous and/or poorly organized for human access and processing, and/or are a form not perceivable and/or expressible by the human mind; the involved data manipulation operations and/or subprocesses are too complex, and/or too different from typical human mental operations; required response times are too short to be satisfied by human performance; etc.
Client 202 provides unstructured data 208 including instructional content to be used in generating structured data for use in mixed reality applications. In various embodiments, unstructured data 208 contains images, videos, or other content. Unstructured data 208 is transformed to unstructured text 210, which is a textual representation of unstructured data 208. The transformation is performed using parsing, image analysis, or other data extraction techniques. In some embodiments, unstructured text 210 is a natural language representation of unstructured data 208.
API 204 obtains unstructured text 210 and establishes a system task prompt 212 to provide to generative artificial intelligence model (model) 206. In some embodiments, API 204 stores unstructured data 208, unstructured text 210, or both, as stored data 211. System task prompt 212 includes one or more commands for the generative artificial intelligence model to follow in converting unstructured text 210 into structured data. The one or more commands include defining a schema for the structured data to follow. Table 1 below is an example of commands provided to the model in system task prompt 212 in addition to unstructured text 210.
Generative artificial intelligence model (model) 206 obtains system task prompt 212 and generates proposed structured data 214. In the example illustrated in Table 1 and Table 2, the proposed structured data is formatted as a JavaScript Object Notation (JSON) file. In various embodiments, the system task prompt specifies any suitable format for the proposed structured data.
API 204 receives proposed structured data 214. Proposed structured data 214 is structured according to the schema defined by the one or more commands in system task prompt 212. Table 2 below is an example excerpt of proposed structured data generated using the example commands in Table 1.
In some embodiments, API 204 validates proposed structured data 214 by parsing it into the schema defined by the one or more commands in system task prompt 210 to ensure it follows the defined schema. The structured data is in some embodiments usable to present a mixed reality experience to a viewer.
The viewer may also use client 202 to query model 206 about the mixed reality experience. Viewer query 216 is provided to API 204, which creates a viewer prompt 218 based on viewer query 216. Viewer prompt 218 in some embodiments include information from stored data 211. In some embodiments, viewer prompt 218 includes additional commands. The additional commands included in viewer prompt 218 are in various embodiments determined based on a portion of the mixed reality experience the viewer is viewing, an action of the viewer, a feature of the physical environment, etc.
The viewer may query the facility for information regarding the step in the mixed reality experience they are performing. Referring to
Viewer prompt 218 is provided to model 206, which processes the viewer prompt to produce a response 214, which is processed by the facility into viewer query response 216. In various embodiments, viewer query response 216 is provided to the viewer as speech using text-to-speech, as a virtual artifact in the mixed reality experience, etc.
After a start block, process 300 begins at block 302, where the facility receives unstructured data. The unstructured data is in various embodiments a portable document format (PDF), text document, image, video, or any other media or multimedia format.
While
Returning to
In block 306, the facility creates a system task prompt based on the unstructured data and the schema input. The unstructured data is converted to a text format such as plain text before being used to create the system task prompt. As discussed herein, this conversion is performed using any known data extraction technique. In various embodiments where the unstructured data is in a multimedia format, conversion includes applying voice-to-text to audio content, summarization of content by a generative artificial intelligence model, etc. In some embodiments, the system task prompt is created by concatenating the unstructured data with the schema input. The system task prompt in some embodiments includes additional commands for the generative artificial intelligence model to follow, such as “carefully follow the instructions to create the structured data.” In some embodiments, the facility stores one or more commands to be followed by the model in creating the prompt. For example, an example stored command specifies certain steps, language, etc., that must be included in the structured data.
In some embodiments, a predetermined command specifies a “temperature” parameter for the model indicating the desired randomness of the model's output. In various embodiments, it is desirable to change the temperature of the model used in generating the response. The temperature is in various embodiments based on the presence of keywords in, the tone of, the subject matter of, etc. the unstructured data. The facility in some embodiments requests the model to lower its temperature to account for a higher perceived risk present in one or more steps in the procedure. For example, when certain keywords such as “hazard,” “care,” “danger,” etc. are present in a step, the perceived risk of the step is higher. In various embodiments, the model may be prompted to set its own temperature parameter based on a tone, subject matter, content, etc. of the unstructured data.
After block 306, process 300 continues to block 308 where the facility submits the system task to a generative artificial intelligence model. In various embodiments, the generative artificial intelligence model is cohosted with the facility or hosted remotely and accessed via an application programming interface.
After block 308, process 300 continues to block 310 where the facility receives and parses proposed structured data received from the generative artificial intelligence model. According to some embodiments, the proposed structured data is validated by parsing it into the format defined by the schema input. For example, the schema defined in Table 1 includes one or more “procedure” objects, with each procedure object including the fields “procedure_name,” “summary,” and one or more “instructions,” each including a “number,” “instruction,” and “image_file” field. Therefore, the proposed structured data may be parsed for data corresponding to each of these fields. In some embodiments, parsing succeeds when valid data corresponding to each of the fields in the schema is obtained.
In some embodiments, parsing fails because data corresponding to a field was missing, malformed, included an incompatible or incorrect value, etc. For example, a file path corresponding to the “image_file” field may be invalid. In some embodiments where parsing fails, the facility solicits a new selection of an unstructured data file to generate a new system task prompt. In various embodiments, block 310 employs embodiments of
After block 310, process 300 continues to block 312 where the facility transforms the proposed data structure into a procedure usable in a mixed reality application. In some embodiments, the facility transforms the proposed structured data into the procedure using a predetermined mapping between the proposed data structure and the procedure. In some embodiments, process 300 may skip (not shown) the parsing step in block 310 and skip block 312, accepting the response from the generative artificial intelligence model as the procedure usable in the mixed reality application, and ending process 300. In at least these embodiments, the schema input defines a schema for the procedure. After block 312, process 300 ends.
Those skilled in the art will appreciate that the acts shown in
Process 600 begins, after a start block, at block 602 where the facility receives unstructured data, schema input, and preliminary input. In some embodiments, no preliminary input is received.
After block 602, process 600 continues to block 604 where the facility creates a system task prompt based on the unstructured data, schema input, and preliminary input. In some embodiments, the facility creates the system task prompt by concatenating a text version of the unstructured data, the schema input, and the preliminary input. In various embodiments, block 604 employs embodiments similar to those of block 306 in
Returning to
After block 606, process 600 continues to block 608, where the facility receives and parses proposed structured data received from the model. In various embodiments, block 608 employs embodiments of block 310 in
After block 608, process 600 continues to decision block 610, where the facility determines whether supplemental input was received.
If no supplemental input was received, process 600 continues to block 618. If supplemental input was received, process 600 continues to block 612, where the facility creates a supplemental prompt based on the supplemental input. In some embodiments, the supplemental prompt is created by concatenating the supplemental input and the proposed structured data. In various embodiments, block 612 employs embodiments of block 306 in
After block 612, process 600 continues to block 614, where the facility submits the supplemental prompt to the model. The model may be cohosted with the facility or remotely hosted on a server, for example, in cloud 222 in
After block 614, process 600 continues to block 616, where the facility receives revised structured data from the generative artificial intelligence model. In some embodiments, the revised structured data is again parsed (not shown) employing embodiments of block 608 to verify its compliance with the schema input.
Returning to
After a start block, process 1000 begins at block 1002, where the facility presents a mixed reality experience based on a procedure. In some embodiments, the facility creates the procedure using embodiments of process 600 of
In various embodiments, the facility solicits performance of a step in a procedure by displaying content associated with the step. The content in some embodiments includes one or more graphics associated with the step. Referring to
In some embodiments, the facility moves a virtual artifact with the viewer's field of view in the experience, such that the viewer sees the artifact regardless of where they are looking in the experience. In some embodiments, the facility displays a virtual artifact at a selected physical location in the experience, such as on a table, so the viewer only sees the virtual artifact when their field of view includes the selected physical location.
In various embodiments, the facility solicits performance of a step in the procedure by displaying a virtual action artifact corresponding to the step. Referring again to
After block 1002, process 1000 continues to block 1004, where the facility receives a viewer query. In some embodiments, the facility automatically constructs viewer queries using information collected from the physical environment. The facility in some embodiments receive a video feed from a virtual reality headset worn by the viewer and automatically constructs a viewer query based on the video feed. The facility in some embodiments receives a data stream containing information used to display the mixed reality experience to the viewer. The viewer query in some embodiments includes a command specifying that the response from the model should be formatted to be used as a virtual artifact in the mixed reality experience.
In some embodiments, the viewer query is based on an explicit query provided by the viewer. The viewer in some embodiments provides a verbal query requesting information regarding the mixed reality experience. For example, the viewer asks “how many screws do I need for this step?” The viewer query is then created using speech-to-text on the verbal query. In some embodiments, the viewer may select a viewer query from a plurality of viewer queries displayed in the mixed reality experience. The facility in some embodiments predetermines the plurality of viewer queries for one or more steps in the experience based on other viewers' performance of the one or more steps. In some embodiments, the facility dynamically generates the plurality of viewer queries based on the viewer's performance of the one or more steps.
In some embodiments, the viewer query is based on action or inaction of the viewer. For example, if the viewer is having difficulty performing a step in the experience, the facility in this example automatically sends a viewer query commanding the model to generate a response rephrasing instructions associated with the step. In some embodiments, the facility determines that the viewer is having difficulty completing a step if a threshold time limit such as 30 seconds, one minute, five minutes, etc., is exceeded during the viewer's performance of the step.
In some embodiments, the viewer query is contextual. The viewer query in some embodiments correspond to a feature of the physical environment. For example, the facility detects by a video feed from a camera of a virtual reality headset worn by the viewer that the viewer is holding an assembly upside-down based on a comparison with an image associated with the step such as graphic 308b in
After block 1004, process 1000 continues to block 1006, where the facility creates a viewer prompt based on the viewer query. In various embodiments, block 1006 may employ embodiments of block 506 to create the viewer prompt.
After block 1006, process 1000 continues to block 1008, where the facility submits the viewer prompt to a generative artificial intelligence model.
After block 1008, process 1000 continues to block 1010, where the facility receives a response from the generative artificial intelligence model. In some embodiments, the facility uses the response to display a virtual artifact in the mixed reality experience.
After block 1010, process 1000 continues to block 1012, where the facility presents a virtual artifact in the mixed reality experience based on the response. As discussed herein, the virtual artifact is in some embodiments a graphic associated with the step in the mixed reality experience. The virtual artifact is, in some embodiments, a graphic containing text of the response. In various embodiments, the facility displays virtual artifact proximate a physical object in the mixed reality experience, or so that the viewer sees the virtual object regardless of where the viewer is looking in the experience. After block 1012, process 1000 ends at an end block.
The various embodiments described above can be combined to provide further embodiments. All of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet are incorporated herein by reference, in their entirety. Aspects of the embodiments can be modified, if necessary to employ concepts of the various patents, applications and publications to provide yet further embodiments.
These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.
This application claims the benefit of provisional U.S. Application No. 63/603,040, filed on Nov. 27, 2023, and entitled “GENERATING STRUCTURED DATA FOR MIXED REALITY APPLICATIONS USING GENERATIVE ARTIFICIAL INTELLIGENCE” which is hereby incorporated by reference in its entirety. In cases where the present application conflicts with a document incorporated by reference, the present application controls.
| Number | Date | Country | |
|---|---|---|---|
| 63603040 | Nov 2023 | US |