Tutorials and guides are critical for onboarding a user to a new product or service. For example, a site may provide user manuals for a software application. A site may require authors to follow a specified structure for each published document. By providing documents that follow a specific format, a site can improve the readability and effectiveness of its documents, protect their brand, and improve user interactions with products and services.
Authors and managers for sites that provide formatted documents can face a number of challenges. For example, authors have to take the time to read and understand the site's formatting requirements. Formatting requirements can include, but are not limited to, text layout parameters, word limitations, image placement parameters, etc. Thus, in addition to taking the time to formulate accurate content, authors must be mindful of the formatting requirements as they compose documents. These requirements add to the difficulty of composing formatted documents, particularly if a person is not part of a particular content team. When authors work with different teams, authors may have to adjust their writing style in their effort to accommodate unfamiliar formatting requirements.
These issues can be further exacerbated by the fact that an author may have to create similar content for different sites. Such scenarios require authors to go through a number of tedious and time-consuming processes to generate different versions of the same content. Such processes are not scalable nor are they economically feasible. Accuracy can also be an issue since human error will always be a part of manual processes for generating documents.
Also, when documents are manually drafted using existing systems, there may be an issue with consistency between different documents on a particular site. When a number of different authors generate documents for a particular site, each author may have different levels of accuracy or they may have different interpretations of the formatting requirements. This can lead to a library of manuals that have different levels of detail, different formats, etc. When a site has documents that do not consistently follow a given formatting requirement, the readability and usability of the documents may fall below acceptable standards. This can lead to user errors, misuse of a product or service, and other inefficiencies with respect to a site and the use of its computing resources.
The disclosed techniques provide enhanced generation of formatted and organized guides from unstructured spoken narrative using a large language model. A system uses unstructured verbal narrative as input in place of a written input. The system uses a large language model to organize an unstructured narrative into a structured guide that follows specific formatting and category requirements. For example, the formatting requirements may define specific headers, steps, bullets, image locations, etc. The formatting requirements may be interpreted from an existing document that is used as fine-tuning data, or the formatting requirements may also be determined by an analysis of the verbal narrative. The system can identify categories, e.g., topics, to generate the formatted and organized guides. The categories can be determined by an analysis of the verbal narrative. The system can also suggest new categories, relevant explanations, image locations, and other references the author may not have considered. The process is automated, resulting in complete formatted and organized guides ready for review. Completed formatted and organized guides can also be edited by a user.
Features and technical benefits other than those explicitly described above will be apparent from a reading of the following Detailed Description and a review of the associated drawings. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The term “techniques,” for instance, may refer to system(s), method(s), computer-readable instructions, module(s), algorithms, hardware logic, and/or operation(s) as permitted by the context described above and throughout the document.
The Detailed Description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicate similar or identical items. References made to individual items of a plurality of items can use a reference number with a letter of a sequence of letters to refer to each individual item. Generic references to the items may use the specific reference number without the sequence of letters.
The system can receive the unstructured spoken narrative in the form of a speech audio input 111 from the end user 102. The speech audio input 111 can subsequently be converted into speech text 112 by the use of a speech engine 121. This conversion can occur at regular intervals or the system 100 can continuously gather the speech audio input 110 as the end user 102 speaks and generate the speech text 112 until the end user 102 pauses and/or stops speaking.
The speech text 112 can then be processed by a prompt engine 122 to generate model input data 113. The model input data can include formatting requirements 131, one or more categories 132, content data 133, and related content 134. As described above, the prompt engine 122 can generate the formatting requirements 131 from an analysis of the fine-tuning data 110. The prompt engine 122 can generate the categories 132 and content data 133 from an analysis of the unstructured spoken narrative. The categories and select content from the unstructured spoken narrative can be based on predetermined keywords or predetermined phrases identified in the unstructured spoken narrative. In addition, the system can also analyze the unstructured spoken narrative to determine a sentiment. Thus, the prompt engine can parse what the user 112 is literally saying as well as sentiment and emotion implied in the unstructured spoken narrative. For instance, the sentiment can be based in part on the word choice in the unstructured spoken narrative. In addition, subtleties such as inflections, tone of voice, and speaking volume can also be used to determine a sentiment. The determined sentiment can be used to generate either formatting requirements 131, one or more categories 132 or content data 133.
The end user 102 can speak naturally about a product or service. In one illustrative example, the end user can describe the general steps of how to use a software application. The system 100 can analyze the speech audio 111 from the end user and generates data defining categories 132. The categories can define topics that will be used for the final output 140. The determined categories can include, for example, categories such as: Procedures, Tools, Materials, Going Further, etc. The system 100 also identifies select content from the natural language input that corresponds to each category. For example, the natural language input may include a verbal narrative describing steps on how to operate a software application. The system may then associate such content with a category, e.g., a Procedures category.
In some embodiments, the system can also pull in references that were not part of the speech audio input 111 but are related to items mentioned in the speech audio input 111. In such embodiments, the system may generate one or more queries 114 and send those queries to external resources 123, e.g., databases 123A, web sites 123B, and other resources 123N. Those resources 123 can return related content 134 containing the references. The system can then generate model input data 113 comprising the formatting requirements 131, categories 132, content data 133, and the related content 134. The model input data 113 can then cause the large language model 124 to generate the output 140 in a format that provides a consistent structure. In some embodiments, the output 140 is in a markdown format.
Referring now to
As shown in
The prompt engine 122 can identify formatting requirements from the fine-tuning data 110, e.g., an example document, and identify the format of the title, headers, and content. The prompt engine 122 can also identify image locations and other characteristics of the image, e.g., a transparency level, color characteristics, border characteristics, etc. The prompt engine 122 can determine one or more formatting requirements 131 based on a layout and/or format of content of the fine-tuning data 110. Thus, the prompt engine 122 can analyze the document and identify a font type, font size, text formats, etc. The layout of the document can be dependent on spacing, paragraph positions, margins, etc. Such characteristics of the document can be interpreted by the system and recorded as formatting requirements 131.
The prompt engine 122 can also analyze the fine-tuning data 110 to also identify categories 132. The prompt engine 122 can select certain sections of text as categories. The sections of text that are selected as a category can have a predetermined format, such as bolded text, left aligned text, etc. The sections of text that are selected as a category can also include predetermined keywords, such as words like “prerequisites,” “operations,” etc. In this example, the system determined that the fine-tuning data included several categories, e.g., Prerequisites, Vehicle Entry, Vehicle Operation, Operating Hints, Required Content, etc. Individual categories may also have an associated level, e.g., the “Prerequisites” category may be at a first level, while the Operating Hints category may be at a second level. Each level may also have a different format.
The prompt engine 122 can also analyze the fine-tuning data 110 to determine one or more format requirements. For instance, the fine-tuning data 110 may be in the form of a document having an image in the upper right corner. In response to receiving this type of fine-tuning data 110, the system may generate format requirements indicating that an image in the upper right corner of a document is required. Such requirements can also include properties, such as images having a particular resolution, size, etc.
After the pre-training data is used to generate the formatting requirements and categories, the system can receive an unformatted speech input from the end user to generating another document having new content but also having a format similar to the fine-tuning data 110. In this example, the fine-tuning data 110 is a document that pertains to instructions on driving a gas car, and the end user wishes to generate a document having a format that is similar to the fine-tuning data 110 but pertaining to instructions on driving an electric car.
As shown in
The prompt engine 122 can identify the select content from the unstructured verbal narrative. For example, the system may identify specific steps on how to operate a car, prerequisites for driving a car, etc. This can be achieved by identifying individual sections of select content 133 that may match with specific combinations of keywords or by the use of instructions a large language model. For example, the system may provide the speech text 112 to a large language model along with instructions to identify operating steps or prerequisites.
The prompt engine 122 can also identify additional categories from the unstructured verbal narrative. For example, the system may determine categories from the end user's speech, e.g., topics called “Prerequisites, Download, Operation, and Conclusion. In the example of
As shown in
The association data associates specific sections of content that are extracted from the natural language input with individual categories. The association data can also associate the related references with individual categories. For example, content extracted from the verbal narrative input that describes installation steps are categorized as “Procedures,” content extracted from the verbal narrative input that describes parts of a product are categorized as “Materials,” etc. The related references, e.g., the links to the applications, that are retrieved from external resources may be associated with a “Prerequisites” category, etc.
As shown in
As shown in
Referring now to
As shown in
The prompt engine can also identify a need for additional information that was not mentioned in the speech input, generate queries for the additional information and retrieve the related content in response to those queries. The additional information can be stored as related content 134.
The prompt engine 122 can also identify additional categories from the unstructured verbal narrative. For example, the system may determine categories from the end user's speech. For example, if the end user stated they want a Conclusion section, the system can add additional categories to the category list at this stage. In this example, the prompt engine 122 also retrieves other related content 134 from other resources, such as an image to the software application.
As shown in
As shown in
As shown in
Any of the described features can be combined to generate formatted documents from a natural language speech input. For instance, the system can receive a formatted document as pre-training data and also generate formatting requirements and categories from speech input. Such combinations of inputs, e.g., pre-training documents, natural language speech, and gestures conveyed in the speech, such as inflections in a person's voice or tone changes, can be used by a system to identify formatting requirements, content, categories, additional categories, or queries for related content.
Turning now to
The particular implementation of the technologies disclosed herein is a matter of choice dependent on the performance and other requirements of a computing device. Accordingly, the logical operations described herein are referred to variously as states, operations, structural devices, acts, or modules. These states, operations, structural devices, acts, and modules can be implemented in hardware, software, firmware, in special-purpose digital logic, and any combination thereof. It should be appreciated that more or fewer operations can be performed than shown in the figures and described herein. These operations can also be performed in a different order than those described herein.
It also should be understood that the illustrated methods can end at any time and need not be performed in their entireties. Some or all operations of the methods, and/or substantially equivalent operations, can be performed by execution of computer-readable instructions included on a computer-storage media, as defined below. The term “computer-readable instructions,” and variants thereof, as used in the description and claims, is used expansively herein to include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.
Thus, it should be appreciated that the logical operations described herein are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as states, operations, structural devices, acts, or modules. These operations, structural devices, acts, and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof.
For example, the operations of the routine 200 are described herein as being implemented, at least in part, by modules running the features disclosed herein can be a dynamically linked library (DLL), a statically linked library, functionality produced by an application programing interface (API), a compiled program, an interpreted program, a script or any other executable set of instructions. Data can be stored in a data structure in one or more memory components. Data can be retrieved from the data structure by addressing links or references to the data structure.
Although the following illustration refers to the components of the figures, it should be appreciated that the operations of the routine 500 may be also implemented in many other ways. For example, the routine 500 may be implemented, at least in part, by a processor of another remote computer or a local circuit. In addition, one or more of the operations of the routine 500 may alternatively or additionally be implemented, at least in part, by a chipset working alone or in conjunction with other software modules. In the example described below, one or more modules of a computing system can receive and/or process the data disclosed herein. Any service, circuit or application suitable for providing the techniques disclosed herein can be used in operations described herein.
At operation 502, the system receives fine-tuning data. The fine-tuning data can be in the form of an existing document that has a desired format, layout and organization. For instance, the fine-tuning data can be an existing document with instructions on how to install and operate a software application. This will be used by the system as pre-training data to teach the system a format to use when generating a document based on a natural language input. The fine-tuning data can be provided by a computer associated with a system trainer or a system administrator.
At operation 504, the system receives a speech audio input and converts the speech audio input into a text translation. The speech audio input can be received at a microphone at a computing device of an end user.
At operation 506, the system determines formatting requirements by analyzing the speech audio input, speech text, or the fine-tuning data. This operation determines the one or more formatting requirements 131 based on the use of keywords detected in the speech input, a volume of a speech input, voice inflections, speech tone, or other speech characteristics. The formatting requirements can also be based on a format or layout of fine-tuning data, such as an example document. The formatting requirements can also be based on formatting requirements that are derived from previous iterations of this routine. Since this system is a learning process, the system can adjust formatting requirements in each iteration of this routine 500, and constantly fine tune the formatting requirements based on any of the inputs described herein.
At operation 508, the system determines categories by analyzing the text translation of the unstructured verbal narrative. Categories can also be determined by the analysis of the fine-tuning data. The categories are topics to be used as headers for sections of the output file. The categories can be identified by keywords, emphasized keywords and a speech input, and/or emphasized words in the fine-tuning data, e.g., words having large font, in bold text, etc.
At operation 510, the system determines individual sections of select content by analyzing the text translation of the unstructured verbal narrative. This can include the identification of sections of text that describe operations, steps, items, materials, or any other content related to the categories.
At operation 512, the system determines associations between individual categories and the sections of selected content. This can be done by the use of keywords or other relevancy scoring techniques that indicate a relationship between a category and identified content.
At operation 514, the system causes a large language model to generate a formatted document using the model input data. The model input data can include the formatting requirements, categories, the selected content and the association data.
At operation 516, the system causes a modification to the formatted document in response to editing commands. The system receive keyboard commands, voice commands, or other forms of input that indicate edits to the document. Those commands can modify the formatted document, including but not limited to the selected content, the categories or a format property.
At operation 518, the system causes a modification model input data based on the editing commands. This can include modifications to the formatting requirements and categories, in response to commands that cause modifications to the formatting requirements and categories. This modified version of the model input data can be used in future iterations of the routine.
Processing unit(s), such as processing unit(s) of processing system 602, can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-Chip Systems (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
A basic input/output system containing the basic routines that help to transfer information between elements within the computer architecture 600, such as during startup, is stored in the ROM 608. The computer architecture 600 further includes a mass storage device 612 for storing an operating system 614, application(s) 616, modules 618, and other data described herein.
The mass storage device 612 is connected to processing system 602 through a mass storage controller connected to the bus 610. The mass storage device 612 and its associated computer-readable media provide non-volatile storage for the computer architecture 600. Although the description of computer-readable media contained herein refers to a mass storage device, the computer-readable media can be any available computer-readable storage media or communication media that can be accessed by the computer architecture 600.
Computer-readable media includes computer-readable storage media and/or communication media. Computer-readable storage media includes one or more of volatile memory, nonvolatile memory, and/or other persistent and/or auxiliary computer storage media, removable and non-removable computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Thus, computer storage media includes tangible and/or physical forms of media included in a device and/or hardware component that is part of a device or external to a device, including RAM, static RAM (SRAM), dynamic RAM (DRAM), phase change memory (PCM), ROM, erasable programmable ROM (EPROM), electrically EPROM (EEPROM), flash memory, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs), optical cards or other optical storage media, magnetic cassettes, magnetic tape, magnetic disk storage, magnetic cards or other magnetic storage devices or media, solid-state memory devices, storage arrays, network attached storage, storage area networks, hosted computer storage or any other storage memory, storage device, and/or storage medium that can be used to store and maintain information for access by a computing device.
In contrast to computer-readable storage media, communication media can embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer storage media does not include communication media. That is, computer-readable storage media does not include communications media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.
According to various configurations, the computer architecture 600 may operate in a networked environment using logical connections to remote computers through the network 620. The computer architecture 600 may connect to the network 620 through a network interface unit 622 connected to the bus 610. The computer architecture 600 also may include an input/output controller 624 for receiving and processing input from a number of other devices, including a keyboard, mouse, touch, or electronic stylus or pen. Similarly, the input/output controller 624 may provide output to a display screen, a printer, or other type of output device.
The software components described herein may, when loaded into the processing system 602 and executed, transform the processing system 602 and the overall computer architecture 600 from a general-purpose computing system into a special-purpose computing system customized to facilitate the functionality presented herein. The processing system 602 may be constructed from any number of transistors or other discrete circuit elements, which may individually or collectively assume any number of states. More specifically, the processing system 602 may operate as a finite-state machine, in response to executable instructions contained within the software modules disclosed herein. These computer-executable instructions may transform the processing system 602 by specifying how the processing system 602 transition between states, thereby transforming the transistors or other discrete hardware elements constituting the processing system 602.
Accordingly, the distributed computing environment 700 can include a computing environment 702 operating on, in communication with, or as part of the network 704. The network 704 can include various access networks. One or more client devices 706A-706N (hereinafter referred to collectively and/or generically as “computing devices 706”) can communicate with the computing environment 702 via the network 704. In one illustrated configuration, the computing devices 706 include a computing device 706A such as a laptop computer, a desktop computer, or other computing device; a slate or tablet computing device (“tablet computing device”) 706B; a mobile computing device 706C such as a mobile telephone, a smart phone, or other mobile computing device; a server computer 706D; and/or other devices 706N. It should be understood that any number of computing devices 706 can communicate with the computing environment 702.
In various examples, the computing environment 702 includes servers 708, data storage 610, and one or more network interfaces 712. The servers 708 can host various services, virtual machines, portals, and/or other resources. In the illustrated configuration, the servers 708 host virtual machines 714, Web portals 716, mailbox services 718, storage services 720, and/or social networking services 722. As shown in
As mentioned above, the computing environment 702 can include the data storage 710. According to various implementations, the functionality of the data storage 710 is provided by one or more databases operating on, or in communication with, the network 704. The functionality of the data storage 710 also can be provided by one or more servers configured to host data for the computing environment 700. The data storage 710 can include, host, or provide one or more real or virtual datastores 726A-726N (hereinafter referred to collectively and/or generically as “datastores 726”). The datastores 726 are configured to host data used or created by the servers 808 and/or other data. That is, the datastores 726 also can host or store web page documents, word documents, presentation documents, data structures, algorithms for execution by a recommendation engine, and/or other data utilized by any application program. Aspects of the datastores 726 may be associated with a service for storing files.
The computing environment 702 can communicate with, or be accessed by, the network interfaces 712. The network interfaces 712 can include various types of network hardware and software for supporting communications between two or more computing devices including the computing devices and the servers. It should be appreciated that the network interfaces 712 also may be utilized to connect to other types of networks and/or computer systems.
It should be understood that the distributed computing environment 700 described herein can provide any aspects of the software elements described herein with any number of virtual computing resources and/or other distributed computing functionality that can be configured to execute any aspects of the software components disclosed herein. According to various implementations of the concepts and technologies disclosed herein, the distributed computing environment 700 provides the software functionality described herein as a service to the computing devices. It should be understood that the computing devices can include real or virtual machines including server computers, web servers, personal computers, mobile computing devices, smart phones, and/or other devices. As such, various configurations of the concepts and technologies disclosed herein enable any device configured to access the distributed computing environment 700 to utilize the functionality described herein for providing the techniques disclosed herein, among other aspects.
Conditional language such as, among others, “can,” “could,” “might” or “may,” unless specifically stated otherwise, are understood within the context to present that certain examples include, while other examples do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that certain features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether certain features, elements and/or steps are included or are to be performed in any particular example. Conjunctive language such as the phrase “at least one of X, Y or Z,” unless specifically stated otherwise, is to be understood to present that an item, term, etc. may be either X, Y, or Z, or a combination thereof.
The terms “a,” “an,” “the” and similar referents used in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural unless otherwise indicated herein or clearly contradicted by context. The terms “based on,” “based upon,” and similar referents are to be construed as meaning “based at least in part” which includes being “based in part” and “based in whole” unless otherwise indicated or clearly contradicted by context.
In addition, any reference to “first,” “second,” etc. elements within the Summary and/or Detailed Description is not intended to and should not be construed to necessarily correspond to any reference of “first,” “second,” etc. elements of the claims. Rather, any use of “first” and “second” within the Summary, Detailed Description, and/or claims may be used to distinguish between two different instances of the same element (e.g., two different audio inputs).
In closing, although the various configurations have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter.
This patent application claims the benefit of U.S. Provisional Patent Application Ser. No. 63/467,795 filed May 19, 2023, entitled “ENHANCED GENERATION OF FORMATTED AND ORGANIZED GUIDES FROM UNSTRUCTURED SPOKEN NARRATIVE USING LARGE LANGUAGE MODELS,” which is hereby incorporated in its entirety by reference.
Number | Date | Country | |
---|---|---|---|
63467795 | May 2023 | US |