The present application generally relates to generating video surgical reports and, in particular, using machine learning to generate content for video surgical reports.
A surgical report is often used to explain to a patient what occurred during a surgical procedure. These surgical procedures include a description of important events that occurred during the surgery. Generating these reports can be a time consuming and inaccurate process. This is because the generation of these reports relies on the surgeon recalling what events occurred during the surgery, after surgery (i.e., during the report generation process). Since the generation of this report is not contemporaneous with the procedure, the surgeon may not identify all of the events from the surgery and/or may not recall all of the circumstances surrounding those events which have been identified. Further, the generation of these reports takes up valuable time of the surgeon.
Therefore, it would be beneficial to have systems, methods, and programming that produce improved surgical reports and processes for creating surgical reports.
Described are systems, methods, and programming for a machine learning pipeline used to generate a video surgical report. The systems, methods, and programming can be configured to automatically generate video content for the video surgical report. The machine learning pipeline may include one or more machine learning models, each supporting a particular aspect of a video surgical report generation process. The machine learning pipeline can offload much of the work typically performed by the surgeon to create the video surgical report, thereby saving significant time. Additionally, the machine learning pipeline may intelligently curate content for the video surgical report. This curated content may be selected to highlight key surgical events from the surgical procedure. The machine learning models may also generate text, audio, video, or other media for the video surgical report.
The machine learning pipeline may include one or more machine learning models relating to various portions of the video surgical report generation process. For example, the machine learning models may include one or more models relating to information distillation, information extraction, language generation, language translation, audio generation, video generation, etc. Some or all of the machine learning models may access images from a surgical video feed, such as images captured by an endoscope. These images may be used to create the video surgical report describing the surgical procedure.
During an information distillation stage of the video surgical report generation process, one or more machine learning models may analyze the images to determine a surgical phase, surgical activities performed during the surgical phase, or other information. The determined surgical phase and/or surgical activities may be provided to an information extraction stage of the video surgical report generation process. During the information extraction stage, the images may be analyzed using one or more machine learning models to determine whether the images depict any key surgical events and/or anatomical structures. The key surgical events refer to rare and/or unexpected events. Additionally, the machine learning models may determine anatomical structure measurement information associated with the key surgical events and/or the anatomical structures. The anatomical structure measurement information may include measurements describing physical characteristics of anatomical structures determined to be depicted by the images (e.g., shape, size, volume, etc.). The machine learning models may also select a set of images from the images analyzed based on whether the images depict a key surgical event, anatomical structure, or another aspect of the surgical procedure.
The set of images, key surgical events, information describing anatomical structures, anatomical structure quantification information, or other information may be provided to a language generation stage of the video surgical report generation process. During the language generation stage, one or more machine learning models may generate a description for the set of images to be included in the video surgical report. The description may be created based on the key surgical events, anatomical structures, anatomical structure quantification information, or other information. If necessary, the description may be translated to one or more additional languages during a language translation stage. Additionally, the description may be generated in different styles depending on the target audience. For example, one vocabulary may be used to form descriptions targeted to a first audience (e.g., for training individuals) and another vocabulary may be used to form descriptions targeted to a second audience (e.g., for auditing medical procedures). Different language translation models may be trained to translate the description from one language to the additional languages. During an audio generation stage, audio may be generated based on the description, the set of images, and/or other information identified during the video surgical report generation process. The audio may be generated such that it has sound characteristics similar to the vocal characteristics of the surgeon (or another medical professional).
Using the set of images, the generated description, the generated audio, and/or other information, a video surgical report may be generated during a video generation stage. The video surgical report may incorporate some or all of the images, text from the generated description, and/or graphics to help describe various aspects of the surgical procedure. The video surgical report may include a virtual character (e.g., an avatar) programmed to speak the generated audio. For example, data programming facial movements and expressions of the virtual character may be included in the video surgical report such that the virtual character appears to utter the generated text. The video surgical report may be created with minimal manual input from the surgeon, thereby reducing the amount of time the surgeon needs to devote to creating a surgical report.
According to some examples, a method includes obtaining one or more images of a surgical procedure; determining, using one or more machine learning models, a set of images from the one or more images based on the surgical procedure; and generating a video surgical report for the surgical procedure comprising at least some of the set of images. The set of images can comprise fewer images than the obtained one or more images of the surgical procedure. Therefore, the set of images can form a compressed representation of the obtained one or more images of the surgical procedure.
In any of the examples, generating the video surgical report can include generating, using the one or more machine learning models, text describing the at least some of the set of images, wherein the video surgical report comprises at least some of the text corresponding to the at least some of the set of images.
In any of the examples, generating the video surgical report can include generating a virtual character programmed to output audio associated with the at least some of the set of images, the video surgical report comprising the virtual character.
In any of the examples, determining the set of images can include determining, using the one or more machine learning models, at least one of: a phase of the surgical procedure, a surgical activity being performed during the phase, or information related to the set of images.
In any of the examples, determining the set of images can include selecting the set of images from the one or more images based on the at least one of the phase, the surgical activity, or the information related to the set of images.
In any of the examples, the method can further include training the one or more machine learning models to analyze content depicted by the set of images to determine the at least one of the phase, the surgical activity, or the information related to the set of images.
In any of the examples, the one or more images can include a first image and a second image captured during a same phase of the surgical procedure, determining the set of images can include computing, using the one or more machine learning models, a first classification score and a second classification score respectively associated with the first image and the second image; and adding at least one of the first image or the second image to the set of images based on the first classification score and the second classification score.
In any of the examples, the method can further include identifying at least one image from the one or more images based on preoperative information related to the surgical procedure, wherein the set of images comprises the at least one image.
In any of the examples, obtaining the one or more images can include identifying, using the one or more machine learning models, a subset of frames from a video of the surgical procedure that depict one or more objects associated with the surgical procedure, wherein the one or more images comprise the subset of frames.
In any of the examples, the method can further include detecting, using the one or more machine learning models, one or more objects associated with the surgical procedure within the one or more images, wherein the at least some of the set of images are selected based on the one or more objects.
In any of the examples, the method can further include receiving preoperative information related to at least one of the surgical procedure or a patient associated with the surgical procedure; and generating content to be included in the video surgical report based on the received preoperative information.
In any of the examples, the method can further include generating, using the one or more machine learning models, text associated with the set of images; and associating one or more portions of the text with the set of images.
In any of the examples, the method can further include generating, using the one or more machine learning models, first text associated with the set of images, the first text being in a first language; and transforming, using the one or more machine learning models, the first text into second text, the second text being in a second language, the video surgical report comprising at least some of the second text corresponding to the at least some of the set of images.
In any of the examples, generating the video surgical report can include generating, using the one or more machine learning models, text for the at least some of the set of images; and generating, using the one or more machine learning models, audio based on the text, the video surgical report comprising the audio. Hence, audio information representative of the at least some of the set of images can be generated, based on the at least some of the set of images.
In any of the examples, the method can further include receiving, using one or more audio sensors, audio captured during the surgical procedure, wherein the video surgical report comprises at least some of the audio corresponding to the at least some of the set of images.
In any of the examples, generating the video surgical report can include obtaining pre-generated text associated with content depicted by the at least some of the set of images; obtaining user-provided text of audio captured during the surgical procedure; and generating text for the video surgical report based on the pre-generated text and the user-provided text.
In any of the examples, generating the video surgical report can include generating audio based on data stored in an audio profile of a user, the data comprising at least one of a pitch, a timbre, a loudness, or a modulation associated with the user.
In any of the examples, obtaining the one or more images can include accessing video captured during the surgical procedure; and extracting at least one video snippet from the video based on the surgical procedure, wherein the one or more images comprise the at least one video snippet.
In any of the examples, the method can further include obtaining one or more additional images captured subsequent to the surgical procedure; and generating an updated video surgical report comprising at least some of the one or more additional images.
In any of the examples, generating the video surgical report can include adding one or more additional images to the video surgical report based on a similarity between content depicted by the one or more additional images and content depicted by at least one of the set of images, wherein the one or more additional images are captured prior to the surgical procedure.
According to some examples, a non-transitory computer-readable medium stores computer program instructions that, when executed by one or more processors, effectuate the method of any one of any of the examples.
According to some examples, a computer program product comprises software code portions including computer program instructions that, when executed by one or more processors, effectuate the method of any one of any of the examples.
According to some examples, a system includes: memory storing computer program instructions; and one or more processors configured to execute the computer program instructions to cause the one or more processors to perform the method of any of the examples.
According to some examples, a medical device, includes: one or more processors programmed to perform the method of any of the examples.
In any of the examples, the medical device can further include: an image sensor configured to capture the one or more images of the surgical procedure.
It will be appreciated that any of the variations, aspects, features, and options described in view of the methods apply equally to the systems and devices, and vice versa. It will also be clear that any one or more of the above variations, aspects, features, and options can be combined.
The invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Reference will now be made in detail to implementations and various aspects and variations of systems and methods described herein. Although several example variations of the systems and methods are described herein, other variations of the systems and methods may include aspects of the systems and methods described herein combined in any suitable manner having combinations of all or some of the aspects described.
Described are systems, methods, and programming for generating a video surgical report and a machine learning pipeline used for generating the report. The video surgical report may include images, video, text, virtual characters, or other information describing a surgical procedure. Various portions of the surgical report may utilize different information. For example, a set of images for the surgical report may be selected from one or more images of a surgical procedure based on key surgical events. These key surgical events may be identified based on distillation information, such as surgical phases, surgical activities, and/or information related to the selected images. As such, the video surgical report may comprise a compressed representation of the one or more images of the surgical procedure. As another example, audio for the surgical report may be generated based on text descriptions describing the key surgical events and/or the selected images. The text may be created based on the identified images and/or information extracted from those images (e.g., anatomical structures determined to be depicted by the images). The text, the audio, the images, or other information may be utilized to generate video for the surgical report.
The machine learning pipeline may include one or more machine learning models, each configured to perform certain tasks associated with the video surgical report generation process. For example, the machine learning pipeline may include an information distillation stage of the machine learning pipeline (e.g., a surgical phase detection model, a surgical activity detection model, etc.), one or more machine learning models associated with an information extraction stage of the machine learning pipeline (e.g., a surgical event detection model, a surgical anatomy identification model, a surgical anatomy measurement model, etc.), one or more machine learning models associated with a language generation and/or language translation stage of the machine learning pipeline (e.g., a synthetic description generation model, a synthetic description translation model), one or more machine learning models associated with an audio generation stage of the machine learning pipeline (e.g., a synthetic audio generation model), one or more machine learning models associated with a video generation stage of the machine learning pipeline (e.g., a synthetic video generation model), or other models. Some or all of the machine learning models may access a surgical feed (e.g., images and/or videos) of a surgical procedure. The images and/or videos may include those captured by a medical device, such as an endoscope. Pre-processing may be performed to the surgical feed prior to being analyzed by the machine learning models. For example, a video of the surgical procedure captured by an image sensor of a medical device can be parsed into frames (e.g., a sequence of images forming a video). As another example, preoperative information (such as preoperative medical exam results, preoperative medical images, or other preoperative information associated with the surgical procedure, the surgeon to be performing the surgical procedure, the patient to whom the surgical procedure is to be performed, etc.) may be obtained and provided to the machine learning pipeline.
The machine learning models may obtain images (e.g., video frames) of the medical procedure, such as those captured by a medical device (e.g., an endoscope). These images may be provided to some or all of the aforementioned machine learning models, at least a portion of which may generate outputs capable of being fed as input to one or more downstream models. The result of the machine learning pipeline may be a video surgical report that clearly and concisely explains the surgical procedure that was performed using images, video, text, audio, virtual characters, and/or other information. The video surgical report produced by the machine learning pipeline may minimize what, if any, input is needed from the surgeon. Thus, the complex and time-consuming process of creating a surgical report may be offloaded from the surgeon.
It should be noted that although some aspects are described herein with respect to machine learning models, other prediction models (e.g., statistical models or other analytics models) may be used in lieu of or in addition to the machine learning models described herein. For example, a statistical model may replace a machine-learning model and a non-statistical model may replace a non-machine-learning model.
In the following description, it is to be understood that the singular forms “a,” “an,” and “the” used in the following description are intended to include the plural forms as well, unless the context clearly indicates otherwise. It is also to be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It is further to be understood that the terms “includes, “including,” “comprises,” and/or “comprising,” when used herein, specify the presence of stated features, integers, steps, operations, elements, components, and/or units but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, units, and/or groups thereof.
Certain aspects of the present disclosure include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the present disclosure could be embodied in software, firmware, or hardware and, when embodied in software, could be downloaded to reside on and be operated from different platforms used by a variety of operating systems. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that, throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” “generating,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission, or display devices.
The present disclosure in some examples also relates to a device for performing the operations herein. This device may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, computer readable storage medium, such as, but not limited to, any type of disk, including floppy disks, USB flash drives, external hard drives, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability. Suitable processors include central processing units (CPUs), graphical processing units (GPUs), field-programmable gate arrays (FPGAs), and ASICs.
The methods, devices, and systems described herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein.
Medical environment 10 may include devices used to prepare for and/or perform a medical procedure to a patient 12. These devices may also be used after the medical procedure. Such devices may include one or more sensors, one or more medical devices, one or more display devices, one or more light sources, one or more computing devices, or other components. For example, at least one medical device 120 may be located within medical environment 10. Medical device 120 may be used to assist medical staff while performing a medical procedure (e.g., surgery). Medical device 120 may also be used to document events and information from the medical procedure. For example, medical device 120 may be used to input or receive patient information (e.g., to/from electronic medical records (EMRs), electronic health records (EHRs), hospital information system (HIS), communicated in real-time from another system, etc.). The received patient information may be saved onto medical device 120. Alternatively or additionally, the patient information may be displayed using medical device 120. In some aspects, medical device 120 may be used to record patient information. For example, medical device 120 may be used to store the patient information or images in an EMR, EHR, HIS, or other databases.
Medical device 120 may be capable of obtaining, measuring, detecting, and/or saving information related to patient 12. Medical device 120 may or may not be coupled to a network that includes records of patient 12, for example, an EMR, EHR, or HIS. Medical device 120 may include or be integrated with a computing system 102 (e.g., a desktop computer, a laptop computer, a tablet device, etc.) having an application server. For example, medical device 120 may include processors or other hardware components that enable data to be captured, stored, saved, and/or transmitted to other devices. Computing system 102 can have a motherboard that includes one or more processors or other similar control devices as well as one or more memory devices. The processors may control the overall operation of computing system 102 and can include hardwired circuitry, programmable circuitry that executes software, or a combination thereof. The processors may, for example, execute software stored in the memory device. The processors may include, for example, one or more general- or special-purpose programmable microprocessors and/or microcontrollers, graphics processing unit (GPU), tensor processing unit (TPU), application specific integrated circuits (ASICs), programmable logic devices (PLDs), programmable gate arrays (PGAs), or the like. The memory devices may include any combination of one or more random access memories (RAMs), read-only memories (ROMs) (which may be programmable), flash memory, and/or other similar storage devices. Patient information may be input into computing system 102 (e.g., making an operative note during the medical or surgical procedure on patient 12 in medical environment 10) and/or computing system 102 can transmit the patient information to another medical device 120 (via either a wired connection or wirelessly).
Computing system 102 can be positioned in medical environment 10 on a table (stationary or portable), a floor 104, a portable cart 106, an equipment boom, and/or a shelving 103.
In some aspects, medical environment 10 may be an integrated suite used for minimally invasive surgery (MIS) or fully invasive procedures. Video components, audio components, and associated routing may be located throughout medical environment 10. For example, monitor 14 may present video and speakers 118 may output audio. The components may be located on or within the walls, ceilings, or floors of medical environment 10. For example, room cameras 146 may be mounted to walls 148 or a ceiling 150. Wires, cables, and hoses can be routed through suspensions, equipment booms, and/or interstitial space. The wires, cables, and/or hoses in medical environment 10 may be capable of connecting to mobile equipment, such as portable cart 106, C arms, microscopes, etc., to communicate routing audio, video, and data information.
Imaging system 108 may be configured to capture images and/or video, and may route audio, video, and other data (e.g., device control data) throughout medical environment 10. Imaging system 108 and/or associated router(s) may route the information between devices within or proximate to medical environment 10. In some aspects, imaging system 108 and/or associated router(s) (not shown) may be located external to medical environment 10 (e.g., in a room outside of an operating room), such as in a closet. As an example, the closet may be located within a predefined distance of medical environment 10 (e.g., within 325 feet, or 100 meters). In some aspects, imaging system 108 and/or the associated router(s) may be located in a cabinet inside or adjacent to medical environment 10.
The captured images and/or videos may be displayed via one or more display devices. For example, images captured by imaging system 108 may be displayed using monitor 14. Imaging system 108, alone or in combination with one or more audio sensors, may also be capable of recording audio, outputting audio, or a combination thereof. In some aspects, patient information can be input into imaging system 108 and added to the images and videos recorded and/or displayed. Imaging system 108 can include internal storage (e.g., a hard drive, a solid state drive, etc.) for storing the captured images and videos. Imaging system 108 can also display any captured or saved images (e.g., from the internal hard drive). For example, imaging system 108 may cause monitor 14 to display a saved image. As another example, imaging system 108 may display a saved video using a touchscreen monitor 22. Touchscreen monitor 22 and/or monitor 14 may be coupled to imaging system 108 via a wired connection and/or wirelessly. It is contemplated that imaging system 108 could obtain or create images of patient 12 during a medical or surgical procedure from a variety of sources (e.g., from video cameras, video cassette recorders, X-ray scanners (which convert X-ray films to digital files), digital X-ray acquisition apparatus, fluoroscopes, computed tomography (CT) scanners, magnetic resonance imaging (MRI) scanners, ultrasound scanners, charge-coupled (CCD) devices, and other types of scanners (handheld or otherwise)). If coupled to a network, imaging system 108 can also communicate with a picture archiving and communication system (PACS), as is well known to those skilled in the art, to save images and videos in the PACS and to retrieve images and videos from the PACS. Imaging system 108 can couple to and/or integrate with, e.g., an electronic medical records database (e.g., EMR) and/or a media asset management database.
Touchscreen monitor 22 and/or monitor 14 may display images and videos captured live by imaging system 108. Imaging system 108 may include at least one image sensor, for example, disposed within camera head 140. Camera head 140 may be configured to capture an image or a sequence of images (e.g., video frames) of patient 12. Camera head 140 can be a hand-held device, such as an open-field camera or an endoscopic camera. For example, imaging system 108 may be coupled to an endoscope 142, which may include or be coupled to, camera head 140. Imaging system 108 may communicate with a camera control unit 144 via a fiber optic cable 147, which may communicate with imaging system 108 (e.g., via a wired or wireless connection).
Room cameras 146 may also be configured to capture an image or a sequence of images (e.g., video frames) of medical environment 10. The captured image(s) may be displayed using touchscreen monitor 22 and/or monitor 14. In addition to room cameras 146, a camera 152 may be disposed on a surgical light 154 within medical environment 10. Camera 152 may be configured to capture an image or a sequence of images of medical environment 10 and/or patient 12. Images captured by camera head 140, room cameras 146, and/or camera 152 may be routed to imaging system 108, which may then be displayed using touchscreen monitor 22, monitor 14, another display device, or a combination thereof. Additionally, the images captured by camera head 140, room cameras 146, and/or camera 152 may be provided to a database for storage (e.g., an EMR).
Room cameras 146, camera 152, and/or camera head 140 of endoscope 142 (or another camera of imaging system 108) may include at least one solid state image sensor. For example, the image sensor of room cameras 146, camera 152, and/or camera head 140 may include a charge coupled device (CCD), a complementary metal-oxide semiconductor (CMOS) sensor, a charge-injection device (CID), or another suitable sensor technology. The image sensor of room cameras 146, camera 152, and/or camera head 140 may include a single image sensor. The single image sensor may be a grayscale image sensor or a color image sensor having an RGB color filter array deposited on its pixels. The image sensor of room cameras 146, camera 152, and/or camera head 140 may alternatively include three sensors: one sensor for detecting red light, one sensor for detecting green light, and one sensor for detecting blue light.
The medical procedure in which the images may be captured using room cameras 146, camera 152, and/or camera head 140 may be an exploratory procedure, a diagnostic procedure, a study, a surgical procedure, a non-surgical procedure, an invasive procedure, or a non-invasive procedure. As mentioned above, camera head 140 may be an endoscopic camera (e.g., coupled to endoscope 142). It is to be understood that the term endoscopic (and endoscopy in general) is not intended to be limiting, and rather camera head 140 may be configured to capture medical images from various scope-based procedures including but not limited to arthroscopy, ureteroscopy, laparoscopy, colonoscopy, bronchoscopy, etc.
Speakers 118 may be positioned within medical environment 10 to provide sounds, such as music, audible information, and/or alerts, that can be played within the medical environment during the medical procedure. For example, speaker(s) 118 may be installed on ceiling 150 and/or positioned on a bookshelf, on a station, etc.
One or more microphones 16 may sample audio signals within medical environment 10. The sampled audio signals may comprise the sounds played by speakers 118, noises from equipment within medical environment 10, and/or human speech (e.g., voice commands to control one or more medical devices or verbal information conveyed for documentation purposes). Microphone(s) 16 may be located within a speaker (e.g., a smart speaker) attached to monitor 14, as shown in
Medical devices 120 may include one or more sensors 122, such as an image sensor, an audio sensor, a motion sensor, or other types of sensors. Image sensors may be configured to capture one or more images, one or more videos, audio, or other data relating to a medical procedure. As an example, with reference to
Client devices 130-1 to 130-N may be capable of communicating with one or more components of system 100 via a wired and/or wireless connection (e.g., network 170). Client devices 130 may interface with various components of system 100 to cause one or more actions to be performed. For example, client devices 130 may represent one or more devices used to display images and videos to a user (e.g., a surgeon). Examples of client devices 130 may include, but are not limited to, desktop computers, servers, mobile computers, smart devices, wearable devices, cloud computing platforms, display devices, mobile terminals, fixed terminals, or other client devices. Each client device 130-1 to 130-N of client devices 130 may include one or more processors, memory, communications components, display components, audio capture/output devices, imaging components, other components, and/or combinations thereof.
Computing system 102 may include one or more subsystems, such as an information distillation subsystem 110, an information extraction subsystem 112, a language generation subsystem 114, a language translation subsystem 116, an audio generation subsystem 124, a video generation subsystem 126, or other subsystems. Some or all of information distillation subsystem 110, information extraction subsystem 112, language generation subsystem 114, language translation subsystem 116, audio generation subsystem 124, and video generation subsystem 126 may be implemented using one or more processors, memory, and interfaces. Distributed computing architectures and/or cloud-based computing architectures may alternatively or additionally be used to implement some or all of the functionalities associated with information distillation subsystem 110, information extraction subsystem 112, language generation subsystem 114, language translation subsystem 116, audio generation subsystem 124, and video generation subsystem 126.
It should be noted that, while one or more operations are described herein as being performed by particular components of computing system 102, those operations may be performed by other components of computing system 102 or other components of system 100. As an example, while one or more operations are described herein as being performed by components of computing system 102, those operations may alternatively be performed by one or more of medical devices 120 and/or client devices 130.
Information distillation subsystem 110, information extraction subsystem 112, language generation subsystem 114, language translation subsystem 116, audio generation subsystem 124, and/or video generation subsystem 126 may be configured to implement various portions of a video surgical report generation process.
Video surgical report generation process 200 may include an information distillation stage 420 (shown in
Surgical video feed 202 may be provided as input to information distillation stage 420, which may output distillation information 204. Examples of distillation information 204 may include a detected surgical phase and/or one or more surgical activities detected during the surgical phase.
Distillation information 204 may be provided as input to information extraction stage 520. At information extraction stage 520, extraction information 206 may be extracted from surgical video feed 202 based on distillation information 204. Extraction information 206 may include indications of any surgical events that were detected during the surgical procedure, as well as any anatomical structures that are identified. Extraction information 206 may also include details associated with the identified anatomical structures (e.g., length, volume, blood loss, etc.).
Extraction information 206 may be provided as input to language generation stage 620, which may output a standardized description 208 describing certain phases of the surgical procedure. For example, standardized description 208 may describe key surgical events (e.g., rare and/or unexpected events described further below) identified during information extraction stage 520. Standardized description 208 may include text describing the detected key surgical events, the identified anatomical structures identified, details associated with the identified anatomical structures, or other information. For example, portions of the generated text may be associated with one or more images to be included in the video surgical report. Language generation stage 620 may also take, as input, preoperative information 210.
Preoperative information 210 may include results of medical exams performed on the patient prior to the surgical procedure (e.g., blood tests). Preoperative information 210 may instead or additionally include preoperative imaging performed on the patient (e.g., X-Rays, CT scans, MRIs, etc.). One or more of the images captured prior to the surgical procedure being performed may be added to a later produced video surgical report (e.g., video surgical report 216 of
Standardized description 208 may be provided as input to language translation stage 720. At language translation stage 720, standardized description 208 may be translated to obtain translated description 212. Language translation stage 720 may translate standardized description 208 from a first language to a second language. For example, standardized description 208 may include text in English, which may be translated to another language (e.g., English to French, English to Spanish, English to German, English to Japanese, etc.). Furthermore, language translation stage 720 may create multiple translated descriptions 212, each in a different language. Language translation stage 720 may be skipped or omitted if it is determined that standardized description 208 includes text in a language that does not need to be translated.
Standardized description 208 may be provided as input to audio generation stage 820. Optionally, translated description 212 may instead or additionally be provided as input to audio generation stage 820. At audio generation stage 820, audio data representing audio of the generated text (e.g., standardized description 208 and/or translated description 212) may be transformed into audio data 214. For example, a text-to-speech model may be used at audio generation stage 820 to generate audio data 214 representing the text of standardized description 208 and/or translated description 212. Audio generation stage 820 may employ an audio profile associated with a surgeon that performed the surgical procedure. For example, the audio profile may include data indicating a pitch, tone, accent, timbre, a loudness (volume of speech), a modulation associated with the user, and/or other characteristics of the surgeon's voice/speech.
Audio data 214 may be provided as input to video generation stage 920 to obtain a video surgical report 216. At video generation stage 920, the extracted images, video, description, audio, and/or other content may be aggregated into video surgical report 216. The video surgical report 216 may have a shorter duration than the surgical video feed 202. Thus, the video surgical report generation process 200 may include a temporal compression of the surgical video feed 202 with no, or minimal, loss of information. The extracted images, video, description, audio, and/or other content aggregated into video surgical report 216 can even provide that the surgical video feed 202 is temporally compressed, with increase of information content.
Video generation stage 920 may include steps for generating a virtual character programmed to output audio data 214. For example, the virtual character may be an avatar resembling the appearance of a surgeon (e.g., the surgeon that performed the surgical procedure). The virtual character may be generated using deep fake technology and/or other synthetic media. The virtual character may be programmed to speak audio data 214. For example, the facial expressions of the avatar may be used to recreate the facial movements of the surgeon. This may allow video surgical report 216 to present a comprehensive report detailing the surgical procedure using a virtual character of a user (e.g., surgeon, medical staff, etc.) that appears to dictate information about the surgical procedure.
Returning to
Information distillation subsystem 110 may be configured to train one or more machine learning models to produce distillation information (e.g., distillation information 204 in
With reference to
Training data may be retrieved from training data database 164 (e.g., via information distillation subsystem 110) to train machine learning model 402. The training data may be selected from training data database 164 based on a type of model to be trained. For example, in the instance surgical phase detection model 302 is to be trained, the retrieved training data may include images 406a and 406b depicting various surgical phases. Images 406a and 406b may also include associated metadata. The metadata may indicate the particular surgical phases depicted by each of images 406a and 406b. Images 406a may include a first plurality of images representing a first surgical phase and images 406b may include a second plurality of images representing a second surgical phase. For example, the first plurality of images of images 406a may each depict an individual object or combination of objects, whereas the second plurality of images of images 406b may each depict a different object or combination of objects. Although only two sets of images are depicted within
Objects depicted within images 406a and 406b may indicate the surgical phases. For example, certain individual objects and/or combinations of objects may be depicted by images captured during the first phase of the surgical procedure, while different objects and/or combinations of objects may depicted by images captured during the second phase of the surgical procedure.
As another example, in the instance surgical activity detection model 304 is to be trained, the retrieved training data may include images 406a and 406b depicting various surgical activities. These surgical activities may be performed during a detected surgical phase (e.g., a surgical phase detected by surgical phase detection model 302). Images 406a and 406b may also include associated metadata indicating the surgical activities. For example, images 406a may depict a first surgical activity performed during a first surgical phase and images 406b may depict a second surgical activity performed during the first surgical phase. Although only two sets of images are depicted within
Objects depicted by images 406a and 406b may indicate the surgical activities being performed. Sequences of images 406a and 406b may describe movements of objects (e.g., medical staff) within medical environment 10 (shown in
Machine learning model 402 may be implemented using one or more machine learning architectures. For example, machine learning model 402 may be implemented as a convolutional neural network (CNN), a long short-term memory (LSTM) model, a temporal convolutional network (TCN), one or more vision transformers, another type of machine learning model, or a combination thereof. For example, machine learning model 402 may include an EndoNet, TeCNO, OperA, and/or Trans-SVNet model, each of which may correspond to known surgical phase detection models.
During model training 410, images 406a and 406b may be provided as input to machine learning model 402. Machine learning model 402 may predict a result based on the input images and values assigned to hyperparameters of machine learning model 402. For example, the predicted result may include a predicted surgical phase depicted by a given image from images 406a and 406b. As another example, the predicted result may include a predicted surgical activity occurring during the surgical phase based on the given image. The predicted result may be compared to the associated metadata for images 406a and 406b. The comparison may be used to compute a loss, which may then be minimized at 404. For example, a cross-entropy loss may be used at 404. The values of the hyperparameters of machine learning model 402 may be adjusted based on the loss. After the hyperparameters' values have been adjusted, additional images may be provided to machine learning model 402, new predictions may be made, new comparisons can be performed, and adjustments to some or all of the hyperparameters may be made. This process may repeat a predefined number of times and/or until a threshold accuracy level is achieved from machine learning model 402 (e.g., 75% or greater accuracy, 85% or greater accuracy, 95% or greater accuracy, etc.).
The trained version of machine learning model 402 (trained machine learning model 412) may be used for information distillation stage 420. During information distillation stage 420, image feed 408 may be retrieved from image database 162 (e.g., via information distillation subsystem 110 illustrated in
Based on image feed 408, trained machine learning model 412 may output a surgical phase result 414, a surgical activity result 416, or other results. Surgical phase result 414 may indicate a predicted surgical phase depicted by image feed 408. Surgical phase result 414 may be a vector having n dimensions, where n equals a number of surgical phases that may occur during a surgical procedure. Surgical activity result 416 may indicate a predicted surgical activity depicted by image feed 408. Surgical activity result 416 may be based on surgical phase result 414. For example, a certain subset of surgical activities may typically occur during a first surgical phase, while another subset of surgical activities may typically occur during a second surgical phase. Surgical activity result 416 may be a vector having m dimensions, where m equals a number of surgical activities that may occur during the surgical procedure. Trained machine learning model 412 may output separate results, for example a separate surgical phase result 414 and surgical activity result 416. Alternatively, a single result may be output including each of surgical phase result 414 and surgical activity result 416. Therefore, for each input image from image feed 408, trained machine learning model 412, may determine the corresponding surgical phase result 414 and/or surgical activity result 416.
Trained machine learning model 412 may classify images from image feed 408 together based on surgical phase result 414 and/or surgical activity result 416. Trained machine learning model 412 may identify a subset of frames from image feed 408 (e.g., video of the surgical procedure) that depict objects associated with surgical phases, surgical activities, or other aspects of the surgical procedure. For example, for each image frame from image feed 408, a surgical phase may be determined by trained machine learning model 412, which may be indicated by surgical phase result 414. Image frames from image feed 408 that are determined to depict the same surgical phase (e.g., surgical phase result 414 may be the same or similar for multiple images from image feed 408) may be classified (e.g., grouped) together. As an example, image feed 408 may include a first image and a second image captured during a same phase of the surgical procedure. Trained machine learning model 412 may compute a first classification score for the first image and a second classification score for the second image. The first classification score may indicate how likely the first image depicts a particular surgical phase, and the second classification score may indicate how likely the second image depicts the same surgical phase. The first image and/or the second image may be added to a set of images which may be used for a video surgical report produced by video surgical report generation process 200 of
Returning to
Information extraction subsystem 112 may be configured to train one or more machine learning models to produce extraction information based on one or more images and distillation information 204 (shown in
As an example, with reference to
Information extraction subsystem 112 (shown in
In the instance surgical anatomy identification model 312 is to be trained, the retrieved training data may include images 506a and 506b depicting examples of various anatomical structures. For example, the first plurality of images (e.g., images 506a) may depict an individual anatomical structure and/or combinations of anatomical structures, and the second plurality of images (e.g., images 506b) may depict a different anatomical structure and/or combinations of anatomical structures. Images 506a and 506b may also include associated metadata indicating the particular anatomical structure depicted by each of images 506a and 506b, as well as any surgical events detected. For example, images 506a may depict an example of a first anatomical structure and images 506b may depict an example of a second anatomical structure.
In the instance surgical anatomy measurement model 314 is to be trained, the retrieved training data may include images 506a and 506b depicting examples of various anatomical structures, scales for measuring those anatomical structures, measurements of those anatomical structures (e.g., length, width, circumference, volume, etc.), or other information. Images 506a and 506b may also include associated metadata indicating the particular anatomical structure depicted by those images, as well as measurements or other information measured with respect to the anatomical structures. Although only two sets of images are depicted within
The training data may also include distillation information 508a and 508b, corresponding to images 506a and 506b, respectively. Distillation information 508a and 508b may include an indication of a surgical phase and/or surgical activity. For example, distillation information 508a and 508b may each include surgical phase result 414 and surgical activity result 416 (each of which are shown in
Machine learning model 502 may be implemented using one or more machine learning architectures. For example, machine learning model 502 may be implemented as a convolutional neural network (CNN), a long short-term memory (LSTM) model, a temporal convolutional network (TCN), one or more vision transformers, a generative adversarial network (GAN), regressors, another type of machine learning model, and/or a combination thereof. For example, machine learning model 502 may include a You-Only-Look-Once (YOLO), Unet, Deeplab, LRASPP (based on MobileNet v3) model, each of which may correspond to known machine learning models for information extraction.
During model training 510, images 506a and 506b may be provided as input to machine learning model 502. Machine learning model 502 may predict a result based on images 506a and 506b, distillation information 508a and 508b, and values assigned to hyperparameters of machine learning model 502. The predicted result may include predicted extraction information (e.g., extraction information 206 shown in
The trained version of machine learning model 502 (trained machine learning model 512) may be used at information extraction stage 520. During information extraction stage 520, image feed 514 may be retrieved (e.g., by information extraction subsystem 112 illustrated in
As mentioned above, machine learning model 502 may be trained to determine whether a given image depicts a key surgical event, whether the given image depicts one or more anatomical structures, whether anatomical structure measurement information can be derived, other information, and/or combinations thereof. The trained version of machine learning model 502 may correspond to trained machine learning model 512. At information extraction stage 520, information extraction subsystem 112 (shown in
Key surgical event result 516 may indicate one or more key surgical events depicted by image feed 514. The key surgical event may correspond to one of a plurality of key surgical events associated with the surgical procedure. Key surgical event result 516 may be an n-dimensional vector, where n indicates a number of key surgical events that may occur during the surgical procedure. Each element of the vector may include a score indicating a likelihood that trained machine learning model 512 identified one of the key surgical events from a predetermined set of known surgical events.
Anatomical structure result 518 may indicate one or more anatomical structures depicted by image feed 514. Anatomical structure result 518 may be based on key surgical event result 516. Based on the key surgical events detected, different subsets of anatomical structures may be expected to be visible within image feed 514. For example, during one key surgical event, a first subset of anatomical structures may be expected to be depicted by at least a portion of image feed 514, whereas during another (different) key surgical event, a second subset of anatomical structures may be expected to be depicted by at least another portion of image feed 514. Anatomical structure result 518 may be an m-dimensional vector, where m indicates a number of anatomical structures that may be detected during the surgical procedure from a known set of anatomical structures. Each element of the vector may include a score indicating a likelihood that trained machine learning model 512 identified one of the known anatomical structures.
Anatomical structure measurement result 522 may indicate an estimated size, shape, and/or other measurements associated with the anatomical structures depicted by image feed 514. Anatomical structure measurement result 522 may be based on anatomical structure result 518. Anatomical structure measurement result 522 may be a number, for example, a length, width, height, volume, etc., associated with the anatomical structure. Different units of measurement may be used to express anatomical structure measurement result 522. For example, the measurements may be expressed in centimeters, millimeters, inches, feet, etc.
Returning to
As mentioned above with respect to
With reference to
Training data may be received from training data database 164 (e.g., via language generation subsystem 114) to train machine learning model 602. The training data may be selected from training data database 164 based on a type of model to be trained. For example, for description generation model 320, the retrieved training data may include relevant information 606 and/or pre-generated text 608. Pre-generated text 608 may represent text generated based on relevant information 606. For example, relevant information 606 may include critical information describing a previously performed surgical procedure. Relevant information 606 may include information related to some or all of the key surgical events, some or all of the anatomical structures, and/or some or all of the measurements for the anatomical structures (determined as described above). Pre-generated text 608 may include descriptions manually generated by a surgeon or other medical professional associated with the medical procedure and/or descriptions generated using one or more machine learning models (e.g., prior instances of machine learning model 602, different machine learning models, etc.).
Machine learning model 602 may be implemented using one or more machine learning architectures. For example, machine learning model 602 may be implemented as a convolutional neural network (CNN), a generative adversarial network (GAN), or another type of machine learning model, or a combination thereof. An example machine learning model that may be used for machine learning model 602 includes the framework described by LSTM vs. GRU vs. Bidirectional RNN for script generation, to Mangal et al., 2018.
Model training 610 may provide machine learning model 602 with relevant information 606 and pre-generated text 608. Machine learning model 602 may generate a predicted description based on relevant information 606 and values assigned to hyperparameters of machine learning model 602. As an example, the predicted description may include text based on key surgical events identified, anatomical structures detected during those key surgical events, measurements of the anatomical structures, and/or other information. The predicted description may include one or more sentences written in a predetermined language (e.g., English, French, Spanish, German, Japanese, etc.). The predicted description may be compared to pre-generated text 608 to evaluate the performance of machine learning model 602 in generating contextually relevant text based on relevant information 606. The comparison may be used to compute a loss, which may then be minimized at 604. For example, a cross-entropy loss may be used at 604. The hyper-parameter values may be adjusted, and images may again be provided to machine learning model 602, whereby new predictions may be made and new comparisons may be performed, and additional adjustments may be made to the hyperparameters of machine learning model 602. This process may repeat a predefined number of times and/or until a threshold accuracy level is achieved from machine learning model 602 (e.g., 75% or greater accuracy, 85% or greater accuracy, 95% or greater accuracy, and the like).
The trained version of machine learning model 602 (analogous to trained machine learning model 614) may be used at language generation stage 620. During language generation stage 620, one or more images 530 may be retrieved (e.g., via language generation subsystem 114 shown in
Based on the input comprising relevant information 612 and/or images 530, trained machine learning model 614 may output generated text 616. For example, standardized description 208 may include generated text 616. Generated text 616 may also include images 618. Images 618 may correspond to some or all of images 530. Trained machine learning model 614 may determine whether any images 530 are redundant or not needed to describe a particular aspect of the surgical procedure. For example, if two or more of images 530 depict the same content, then trained ML model 614 may remove all but one of those images from images 530 to obtain images 618. The user-provided text data may be compared to pre-generated text data associated with relevant information 612. For example, relevant information 612 may include pre-generated text data comprising an indication of the content depicted by one or more images from surgical video feed 202 (shown in
Returning to
As mentioned above with respect to
As an example, with reference to
Training data from training data database 164 to train machine learning model 702 of
Machine learning model 702 may be implemented using one or more machine learning architectures. For example, machine learning model 702 may be implemented as a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory (LSTM) model, an LSTM plus Attention model, a transformer, generative adversarial network (GAN), another type of machine learning model, or a combination thereof. An example machine learning model that may be used for machine learning model 702 includes the framework described by “Deep contextualized word representations,” to Peters et al., 2018 and/or “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” to Devlin et al., 2018.
Model training 710 may provide machine learning model 702 with relevant information 606, original language text 706, and target language text 708, and/or other information. Machine learning model 702 may generate predicted text in the target language based on original language text 706, relevant information 606, and values assigned to hyperparameters of machine learning model 702. The predicted text may include a translated description (e.g., translated description 212 shown in
The trained version of machine learning model 702 (analogous to trained machine learning model 714) may be used at language translation stage 720. During language translation stage 720, a description 712 and/or images 530 may be retrieved (e.g., via language translation subsystem 116 illustrated in
Generated text 716 may include images 718 in addition to text in the target language. Images 718 may include some or all of images 530. In an example, images 718 may include some or all of images 618 of
Returning to
As mentioned above with respect to
As an example, with reference to
Machine learning model 802 may be implemented using one or more machine learning architectures. For example, machine learning model 802 may be implemented as a convolutional neural network (CNN), generative adversarial network (GAN), other text-to-speech (TTS) models, and/or a combination thereof. An example machine learning model that may be used for machine learning model 802 includes the GANSynth framework.
At model training 810, machine learning model 802 may be provided with text description 806. Text description 806 may be similar to standardized description 208 output at language generation stage 620 or translated description 212 output at language translation stage 720, as shown in
The predicted audio may be compared to target audio 808 to evaluate the performance of machine learning model 802 in generating audio for text description 806. The comparison may be used to compute a loss, which may then be minimized at 804. For example, a cross-entropy loss may be used at 804. Values of the hyperparameters of machine learning model 802 may be adjusted and additional text descriptions (e.g., text descriptions 806) may be provided as input to machine learning model 802. Predicted audio may be generated for the additional text descriptions and new comparisons may be performed, which may be used to determine adjustments to the values of the hyperparameters of machine learning model 802. This process may repeat a predefined number of times and/or until a threshold accuracy level is achieved from machine learning model 802 (e.g., 75% or greater accuracy, 85% or greater accuracy, 95% or greater accuracy, etc.).
The trained version of machine learning model 802 (analogous to trained machine learning model 812) may be used at audio generation stage 820. During audio generation stage 820, text description 814 may be retrieved (e.g., via audio generation subsystem 124) and provided to trained machine learning model 812 to produce generated audio 816. Generated audio 816 may be analogous to audio data 214 output from audio generation stage 820, as shown in
Returning to
With reference to
As mentioned above, virtual character data 914 may be used to assist in generating the video surgical report. For example, virtual character data 914 may be used to generate a virtual character (e.g., an avatar) that will be rendered in the video surgical report. The video surgical report may include data programmed to cause the virtual character to execute life-like motions and appear as though the virtual character is speaking. For example, virtual character data 914 may include programming describing how facial expressions of the user (e.g., surgeon or another medical professional) change when different phenomes are uttered, so as to re-create real-life facial movements. For example, the data may cause facial features of the virtual character to move while audio data 908 is output. Virtual character data 914 may include surgeon profile information associated with the surgeon or another medical professional associated with the surgical procedure. Using the surgical profile information, the video surgical report may include an avatar of the surgeon that performed the surgery, and the surgeon avatar may appear to talk to the patient viewing the video surgical report. For example, the avatar may describe the key surgical events that transpired during the surgery, which may enable the video surgical report to communicate key elements of the patient's health more effectively. The video surgical report may be provided to the patient for subsequent review after the surgical procedure has been completed (e.g., while the patient is recovering).
The predicted video may be compared to one or more pre-generated videos 912 to evaluate the performance of machine learning model 902 in generating video surgical reports based on the input text description, audio, virtual character data, and/or images of key surgical events. The comparison may be used to compute a loss, which may then be minimized at 904. For example, a cross-entropy loss may be used at 904. The hyper-parameter values may be adjusted and additional text descriptions 906, audio data 908, virtual character data 914, and/or images 530 may be provided as input to machine learning model 902. Predicted video surgical reports may be generated for the additional text descriptions, audio data, virtual character data, and/or images, and new comparisons may be performed. The values of the hyperparameters may be adjusted further based on the comparisons. This process may repeat a predefined number of times and/or until a threshold accuracy level is achieved from machine learning model 902 (e.g., 75% or greater accuracy, 85% or greater accuracy, 95% or greater accuracy, etc.).
In an example, machine learning model 902 may be implemented with a multi-modal, multi-task, multi-embodiment agent using a single neural model across different tasks. For example, machine learning model 902 may include the model described in “A Generalist Agent,” to Reed et al., November 2022, the disclosure of which is incorporated by reference in its entirety. For example, machine learning model 902 may be trained on data from different tasks and different modalities. The data may be serialized into a flat sequence of tokens, batched, and processed. In an example, a transformer neural network may be used. Masking may also be used such that the loss function may be applied only to its target outputs.
The trained version of machine learning model 902 (analogous to trained machine learning model 924) may be used at video generation stage 920. During video generation stage 920, text description 916, audio data 918 associated with text descriptions 916, and some or all of images 530 associated with text description 916 may be retrieved (e.g., via video generation subsystem 126). Additionally, video generation subsystem 126 may retrieve virtual character data 922 associated with a surgeon who performed and/or is associated with the surgical procedure for which the video surgical report is made. Video generation subsystem 126 may provide text description 916, audio data 918 associated with text descriptions 916, some or all of images 530 associated with text description 916, and virtual character data 922 to trained machine learning model 924, which may output video surgical report 926. Video surgical report 926 may be analogous to video surgical report 216 output from video generation stage 920 of video surgical report generation process 200 shown in
In some examples, machine learning model 924 may produce video surgical report 926 based on images 530 (which may include videos), pre-generated videos 912, and/or virtual character data 914. In this example, generated audio 816 produced at audio generation stage 820 (shown in
Images 1004 may include some or all of images 530. For example, images 1004 may include a subset of images 530 that depict key surgical events, identified anatomical structures, and/or other aspects of the surgical procedure. Images 1004 may instead or additionally include videos and/or video snippets of key surgical events. Images 1004 may be selected for presentation within video surgical report 1000 such that they synchronize with the audio being output during surgical report 1000. In addition to or instead of images 1004, graphics 1006 may be included within video surgical report 1000. Graphics 1006 may include pre-generated images, images extracted from a surgical video feed, and/or other content that can be presented to help explain or detail aspects of the surgical procedure. For example, graphics 1006 may include an animation associated with a key surgical event depicted by images 1004. Video surgical report 1000 may include two or more instances of images 1004, and each of these images may include graphics. For example, a pre-generated image of a particular anatomical structure may be presented within video surgical report 1000 in addition to an image extracted the surgical video feed determined to depict the same anatomical structure.
In some examples, video surgical report 1000 may be a component of a surgical report application (e.g., a mobile application, web application, etc.) that facilitates interactions with a patient. The surgical report application may communicate with a chatbot or other virtual communications system that enables a user, such as the patient, to input comments and/or questions for a medical professional and receive responses to those comments and/or questions. As an example, a user (e.g., a patient reviewing video surgical report 1000 on client device 130) may input a text query in a field of a user interface displaying video surgical report 1000. As another example, the user may speak an utterance that is detected by a speech recognition engine. The surgical report application may be configured to receive the input(s) and provide a response to the input(s). The response may be based on data associated with the surgical procedure, the patient's metadata, or may be based on other factors. For example, multimodal video question and answering techniques can be used by integrating a patient's electronic medical record information with data representing video surgical report 1000. Additionally, a knowledge-based video question and answering approach may use explicit data source-specific guidelines for the surgical procedure type to formulate the response to the user's input.
The surgical report application may be configured to generate audio and/or video of the response, which may be provided to the user within a user interface (e.g., a user interface presented on a display screen of a desktop computer, mobile phone, tablet, etc.) presenting video surgical report 1000. The surgical report application may be configured to program virtual character 1002 such that it appears to speak the generated response to the user input. For example, the surgical report application may generate a response to a question based on the above-mentioned question and answering techniques/approaches. The generated response may be text-based, and thus speech recognition processing techniques (e.g., such as those described above with respect to description generation model 320) may be used to generate audio data representing the response text (e.g., audio data 214 shown in
An option may be provided for a user to override one or more outputs of the machine learning models used in video surgical report generation process 200 of
Additional images and/or videos may also be captured after the surgical procedure, which may be used to generate an updated version of video surgical report 216. For example, post-surgical medical imaging may be performed on the patient (e.g., patient 12 of
At step 1104, a set of images from the obtained images may be determined based on the surgical procedure. The set of images may be selected based on one or more machine learning models. For example, the obtained images may be provided to one or more machine learning models during an information distillation stage 420. At information distillation stage 420, machine learning models, such as surgical phase detection model 302 and/or surgical activity detection model 304 shown in
At step 1106, a video surgical report may be generated for the surgical procedure, the video surgical report including the set of images. The video surgical report may be generated using one or more machine learning models. For example, machine learning models may be used to generate text describing the key surgical events, anatomical structures, etc., associated with a surgical activity occurring during a particular surgical phase. The machine learning models may instead or additionally be used to generate audio of the text, which may be produced in one or more languages. The machine learning models may also be used to generate the video surgical report based on the generated audio, the generated text, and/or other information distilled and extracted from the surgical video feed and/or preoperative information.
As an example, at language generation stage 620 of
As another example, at audio generation stage 820, standardized description 208 and/or translated description 212 may be provided as input to one or more machine learning models (e.g., trained machine learning model 812 of
As another example, at video generation stage 920, standardized description 208, translated description 212, extraction information 206, distillation information 204, and/or other information, may be provided as input to one or more machine learning models (e.g., trained machine learning model 924 shown in
Video surgical report 926 may be presented within a user interface of a surgical report application. The surgical report application may interface with synthetic speech, audio, and video programs to enable a user (e.g., a patient) to input questions/comments and/or receive feedback to those input questions/comments. Video surgical report 926 may be presented to patient 12 using client device 130 (e.g., shown in
Input device 1220 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, gesture recognition component of a virtual/augmented reality system, or voice-recognition device. Output device 1230 can be or include any suitable device that provides output, such as a touch screen, haptics device, virtual/augmented reality display, or speaker.
Storage 1240 can be any suitable device that provides storage, such as an electrical, magnetic, or optical memory including a RAM, cache, hard drive, removable storage disk, or other non-transitory computer readable medium. Communication device 1260 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device. The components of the computer can be coupled in any suitable manner, such as via a physical bus or wirelessly.
Software 1250, which can be stored in storage 1240 and executed by processor 1210, can include, for example, the programming that embodies the functionality of the present disclosure (e.g., as embodied in the devices as described above). For example, software 1250 can include one or more programs for performing one or more of the steps of the methods disclosed herein.
Software 1250 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a computer-readable storage medium can be any medium, such as storage 1240, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.
Software 1250 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a transport medium can be any medium that can communicate, propagate or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared wired or wireless propagation medium.
Computing system 1200 may be coupled to a network, which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
Computing system 1200 can implement any operating system suitable for operating on the network. Software 1250 can be written in any suitable programming language, such as C, C++, C#, Java, or Python. In various examples, application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example.
As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). The words “include”, “including”, and “includes” and the like mean including, but not limited to. As used throughout this application, the singular forms “a,” “an,” and “the” include plural referents unless the content explicitly indicates otherwise. Thus, for example, reference to “an element” or “a element” includes a combination of two or more elements, notwithstanding use of other terms and phrases for one or more elements, such as “one or more.” The term “or” is, unless indicated otherwise, non-exclusive, i.e., encompassing both “and” and “or.” Terms describing conditional relationships, e.g., “in response to X, Y,” “upon X, Y,” “if X, Y,” “when X, Y,” and the like, encompass causal relationships in which the antecedent is a necessary causal condition, the antecedent is a sufficient causal condition, or the antecedent is a contributory causal condition of the consequent, e.g., “state X occurs upon condition Y obtaining” is generic to “X occurs solely upon Y” and “X occurs upon Y and Z.” Such conditional relationships are not limited to consequences that instantly follow the antecedent obtaining, as some consequences may be delayed, and in conditional statements, antecedents are connected to their consequents, e.g., the antecedent is relevant to the likelihood of the consequent occurring. Statements in which a plurality of attributes or functions are mapped to a plurality of objects (e.g., one or more processors performing steps A, B, C, and D) encompasses both all such attributes or functions being mapped to all such objects and subsets of the attributes or functions being mapped to subsets of the attributes or functions (e.g., both all processors each performing steps A-D, and a case in which processor 1 performs step A, processor 2 performs step B and part of step C, and processor 3 performs part of step C and step D), unless otherwise indicated. Further, unless otherwise indicated, statements that one value or action is “based on” another condition or value encompass both instances in which the condition or value is the sole factor and instances in which the condition or value is one factor among a plurality of factors. Unless otherwise indicated, statements that “each” instance of some collection have some property should not be read to exclude cases where some otherwise identical or similar members of a larger collection do not have the property, i.e., each does not necessarily mean each and every. Limitations as to sequence of recited steps should not be read into the claims unless explicitly specified, e.g., with explicit language like “after performing X, performing Y,” in contrast to statements that might be improperly argued to imply sequence limitations, like “performing X on items, performing Y on the X'ed items,” used for purposes of making claims more readable rather than specifying sequence. Statements referring to “at least Z of A, B, and C,” and the like (e.g., “at least Z of A, B, or C”), refer to at least Z of the listed categories (A, B, and C) and do not require at least Z units in each category. Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining” or the like refer to actions or processes of a specific apparatus, such as a special purpose computer or a similar special purpose electronic processing/computing device. Features described with reference to geometric constructs, like “parallel,” “perpendicular/orthogonal,” “square”, “cylindrical,” and the like, should be construed as encompassing items that substantially embody the properties of the geometric construct, e.g., reference to “parallel” surfaces encompasses substantially parallel surfaces. The permitted range of deviation from Platonic ideals of these geometric constructs is to be determined with reference to ranges in the specification, and where such ranges are not stated, with reference to industry norms in the field of use, and where such ranges are not defined, with reference to industry norms in the field of manufacturing of the designated feature, and where such ranges are not defined, features substantially embodying a geometric construct should be construed to include those features within 15% of the defining attributes of that geometric construct. The terms “first”, “second”, “third,” “given” and so on, if used in the claims, are used to distinguish or otherwise identify, and not to show a sequential or numerical limitation. As is the case in ordinary usage in the field, data structures and formats described with reference to uses salient to a human need not be presented in a human-intelligible format to constitute the described data structure or format, e.g., text need not be rendered or even encoded in Unicode or ASCII to constitute text; images, maps, and data-visualizations need not be displayed or decoded to constitute images, maps, and data-visualizations, respectively; speech, music, and other audio need not be emitted through a speaker or decoded to constitute speech, music, or other audio, respectively.
The foregoing description, for the purpose of explanation, has been described with reference to specific aspects. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The aspects were chosen and described in order to best explain the principles of the techniques and their practical applications. Others skilled in the art are thereby enabled to best utilize the techniques and various aspects with various modifications as are suited to the particular use contemplated.
Although the disclosure and examples have been fully described with reference to the accompanying figures, it is to be noted that various changes and modifications will become apparent to those skilled in the art. Such changes and modifications are to be understood as being included within the scope of the disclosure and examples as defined by the claims. Finally, the entire disclosure of the patents and publications referred to in this application are hereby incorporated herein by reference.
This application claims the benefit of U.S. Provisional Application No. 63/387,929, filed Dec. 16, 2022, the entire contents of which are hereby incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63387929 | Dec 2022 | US |