The present disclosure relates generally to the field of imaging systems and computer-implemented systems and methods for processing real-time video. More specifically, and without limitation, this disclosure relates to systems, methods, and computer-readable media for processing frames of real-time video and performing object detection and characterization. The systems and methods disclosed herein may be used in various applications, such as medical image analysis for polyp detection and characterization, including determining the classification, size, and location of polyps. The systems and methods disclosed herein may also be implemented to provide real-time image processing capabilities, such as identifying, based on one or more of object characteristics, a medical guideline and presenting, in real-time on a display device, information for the medical guideline.
Modern vision and image analysis systems require the ability to detect and characterize objects of interest in a scene. An object of interest may be a person, place, feature, or thing. In some applications, such as systems for medical image analysis, the accuracy of object detection and characterization is important to ensure a proper diagnosis and/or treatment. Example objects of interest in medical applications include lesions, polyps and/or other abnormalities on or of human tissue.
Various object detectors and classifiers have been developed, yet many suffer drawbacks. For example, extant systems may lack the capability to detect variations in object types and/or produce false positives. Some also suffer from limited response time or the inability to efficiently process real-time video signals. Still further, extant systems may not provide object characterization capabilities or only limited object information.
Therefore, there is a need for improved image analysis systems, including for medical image analysis. There is also a need for improved object detection and characterization solutions, including systems that can efficiently process real-time video and provide image analysis and object characterization information. Moreover, there is a need for computer-implemented systems and methods that can aggregate data for an object of interest and provide information depending on a given context, including related to the location, size, and/or classification of the object. Such systems and methods would be useful for applications such as polyp detection and characterization, including during an endoscopy or another medical procedure.
Consistent with some disclosed embodiments, systems, methods, and computer-readable media are provided for processing real-time video, including for processing frames of real-time video and performing object detection and characterization. Embodiments of the present disclosure also relate to systems and methods for object detection and characterization using real-time video from a medical image device. The disclosed embodiments include trained neural networks for detecting objects and determining characterizations of the identified objects, such as classification, location, and/or size. In some embodiments, the trained neural networks are arranged to operate in parallel to more efficiently determine the characterizations of each object during the medical procedure and optionally provide information related to a medical guideline. By way of example, a characterization network may be provided that includes a plurality of trained neural networks, each trained neural network being configured to detect a characterization of an identified object, such as a classification, location, or size. For each identified object, the trained neural networks of the characterization network may be applied and simultaneously operated in parallel to determine the characterizations of the object. As further disclosed herein, object detection and characterization may include polyp detection and characterization, as well as other abnormality detections and characterizations. These and other embodiments, features, and implementations are described herein.
Consistent with the present disclosure, a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed for the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform operations or actions by virtue of including instructions that, when executed by data processing apparatus (such as one or more processors), cause the apparatus to perform such operations or actions.
One general aspect includes a computer-implemented system for processing real-time video. The computer-implemented system may include at least one processor configured to receive a real-time video captured from a medical image device during a medical procedure, where the real-time video includes a plurality of frames. The at least one processor may be further configured to detect an object of interest in the plurality of frames and apply one or more neural networks that implement: a trained classification network configured to determine a classification of the object of interest, a trained location network configured to determine a location associated with the object of interest, and a trained size network configured to determine a size associated with the object of interest. Further, the at least one processor may be configured to identify, based on one or more of the classification, the location, and the size of the object of interest, a medical guideline. The at least one processor may be further configured to present, in real-time on a display device during the medical procedure, information for the identified medical guideline. Other embodiments include corresponding computer methods, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the above operations or features.
Implementations may include one or more of the following features. The medical procedure may include at least one of an endoscopy, a gastroscopy, a colonoscopy, or an enteroscopy. The object of interest may include at least one of a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a lesion. By way of example, the object of interest may be a polyp. The information for the identified medical guideline may include an instruction to leave or resect the object of interest. The information for the identified medical guideline may also include a type of resection. The at least one processor may be further configured to generate a confidence value associated with the identified medical guideline. The determined classification may be based on at least one of a histological classification, a morphological classification, a structural classification, or a malignancy classification. The determined location associated with the object of interest may be a location in a human body. The location in the human body may be one of a location in a rectum, sigmoid colon, descending colon, transverse colon, ascending colon, or cecum. The determined size associated with the object of interest may be a numeric value or a size classification.
The at least one processor may be further configured to apply one or more neural networks that implement a trained quality network to: determine a frame quality associated with at least one of the plurality of frames, and generate a confidence value associated with the determined frame quality. The at least one processor may be further configured to: aggregate data associated with the determined classification, location, and size when at least one of the determined frame quality or the confidence value is above a predetermined threshold and present, on the display device, at least a portion of the aggregated data. Still further, the at least one processor may be configured to detect a plurality of objects of interest in the plurality of frames and determine a plurality of classifications and sizes associated with the plurality of objects of interest, where a classification and a size in the plurality of classifications and sizes are associated with a detected object of interest in the detected plurality of objects of interest. The at least one processor may be further configured to present, on the display device, information associated with one or more classifications and sizes in the plurality of classifications and sizes. Implementations of these and the other above-described operations and techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
Another general aspect includes a computer-implemented system for processing real-time video. The computer-implemented system may include at least one processor configured to receive a real-time video, where the real-time video includes a plurality of frames collected during a medical procedure. The at least one processor may be further configured to detect an object of interest in the plurality of frames and apply one or more neural networks that implement a trained characterization network configured to: determine a plurality of features associated with the object of interest, and determine confidence values associated with the plurality of features. The at least one processor may be further configured to identify, based on one or more of the plurality of features and the confidence values, a medical guideline and present, in real-time on a display device during the medical procedure, information for the identified medical guideline. Other embodiments include corresponding computer methods, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the above operations or features.
Implementations of the above computer-implemented system may include one or more of the following features. The medical procedure may include at least one of an endoscopy, a gastroscopy, a colonoscopy, or an enteroscopy. The object of interest may include at least one of a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a lesion. By way of example, the object of interest may be a polyp. The information for the medical guideline may include an instruction to leave or resect the object of interest. The information for the identified medical guideline may also include a type of resection. The at least one processor may be further configured to generate a confidence value associated with the identified medical guideline. The trained characterization network may include: a trained classification network configured to determine a classification associated with the object of interest and to generate a classification confidence value associated with the determined classification, a trained location network configured to determine a location associated with the object of interest and to generate a location confidence value associated with the determined location, and a trained size network configured to determine a size associated with the object of interest and to generate a size confidence value associated with the determined size. The at least one processor may be further configured to present, on the display device, information associated with at least one of the classification, the location, or the size.
The at least one processor may be further configured to: apply one or more neural networks that implement a trained quality network configured to determine a frame quality associated with at least one of the plurality of frames; and generate a confidence value associated with the determined frame quality. The at least one processor may be further configured to aggregate data associated with the plurality of features when at least one of the determined frame quality or the confidence value is above a predetermined threshold and present, on the display device, at least a portion of the aggregated data. The at least one processor may be further configured to detect a plurality of objects of interest in the plurality of frames, determine a plurality of sets of features associated with the plurality of objects of interest, where a set of features in the plurality of sets of features includes characterization and size information associated with a detected object of interest in the plurality of objects of interest, and present, on the display device, information associated with one or more sets of features in the plurality of sets of features. Implementations of these and the other above-described operations and techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
Another general aspect includes a computer-implemented system for processing real-time video. The computer-implemented system may include at least one processor configured to detect an object of interest in a plurality of frames received from a medical image device and characterize the object of interest, the characterization including determining a plurality of features associated with the object of interest. The plurality of features may include a location and a size of the object of interest. The at least one processor may be further configured to aggregate, when the object of interest persists over more than one of the plurality of frames, information associated with the determined location and size of the object of interest. The at least one processor may present, on a display device, when the determined location is in a first body region and the determined size is within a first range, the aggregated information for the object of interest and present, on the display device, when the determined location is in a second body region and the determined size is within a second range, information indicating a status of the characterization of the object of interest. Other embodiments include corresponding computer methods, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the above operations or features.
Implementations may include one or more of the following features. The at least one processor may be further configured to identify, based on the determined location and size of the object of interest, a medical guideline and present, on the display device, information associated with the identified medical guideline. The plurality of features further may include a classification of the object of interest, the classification being based on at least one of a histological classification, a morphological classification, a structural classification, or a malignancy classification. The determined location associated with the object of interest may be a location in at least one of a rectum, sigmoid colon, descending colon, transverse colon, ascending colon, or cecum. The determined size associated with the object of interest may be a numeric value or a size classification.
The at least one processor may be further configured to detect a plurality of objects of interest in the plurality of frames, characterize the plurality of objects of interest, the characterization including determining a plurality of sets of features associated with the plurality of objects of interest, where a set of features in the plurality of sets of features includes characterization and size information associated with a detected object of interest in the plurality of objects of interest and present, on the display device, information associated with one or more sets of features in the plurality of sets of features. Implementations of these and other above-described operations and techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
Another general aspect includes a computer-implemented method for processing real-time video. The computer-implemented method may include receiving a real-time video captured from a medical image device during a medical procedure, the real-time video may include a plurality of frames. The method further includes detecting an object of interest in the plurality of frames, applying one or more neural networks that implement: a trained classification network configured to determine a classification of the object of interest, a trained location network configured to determine a location associated with the object of interest, and a trained size network configured to determine a size associated with the object of interest. The method also includes identifying, based on one or more of the classification, the location, and the size, a medical guideline and presenting, in real-time on a display device during the medical procedure, information for the identified medical guideline.
Systems and methods consistent with the present disclosure may be implemented using any suitable combination of software, firmware, and hardware. Implementations of the present disclosure may include programs or instructions that are machine constructed and/or programmed specifically for performing functions associated with the disclosed operations or actions. Still further, non-transitory computer-readable storage media may be used that store program instructions, which are executable by at least one processor to perform the steps and/or methods described herein.
It will be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosed embodiments.
The following drawings which comprise a part of this specification, illustrate several embodiments and, together with the description, serve to explain the principles and features of the disclosed embodiments. In the drawings:
Exemplary embodiments are described below with reference to the accompanying drawings. The figures are not necessarily drawn to scale. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It should also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
In the following description, various working examples are provided for illustrative purposes. However, it will be appreciated that the present disclosure may be practiced without one or more of these details.
Throughout this disclosure there are references to “disclosed embodiments,” which refer to examples of inventive ideas, concepts, and/or manifestations described herein. Many related and unrelated embodiments are described throughout this disclosure. The fact that some “disclosed embodiments” are described as exhibiting a feature or characteristic does not mean that other disclosed embodiments necessarily share that feature or characteristic.
This disclosure is provided for the convenience of the reader to provide a basic understanding of a few exemplary embodiments and does not wholly define the breadth of the disclosure. This disclosure is not an extensive overview of all contemplated embodiments and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its purpose is to present some features of one or more embodiments in a simplified form as a prelude to the more detailed description presented later. For convenience, the term “certain embodiments” or “exemplary embodiment” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
Embodiments described herein may refer to a non-transitory computer readable medium containing instructions that when executed by at least one processor, cause the at least one processor to perform a method or set of operations. Non-transitory computer readable mediums may be any medium capable of storing data in any memory in a way that may be read by any computing device with a processor to carry out methods or any other instructions stored in the memory. The non-transitory computer readable medium may be implemented as software, firmware, hardware, or any combination thereof. Software may preferably be implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine may be implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described in this disclosure may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium may be any computer readable medium except for a transitory propagating signal.
The memory may include any mechanism for storing electronic data or instructions, including Random Access Memory (RAM), a Read-Only Memory (ROM), a hard disk, an optical disk, a magnetic medium, a flash memory, other permanent, fixed, volatile or non-volatile memory. The memory may include one or more separate storage devices collocated or disbursed, capable of storing data structures, instructions, or any other data. The memory may further include a memory portion containing instructions for the processor to execute. The memory may also be used as a working memory device for the processors or as a temporary storage.
Some embodiments may involve at least one processor. A processor may be any physical device or group of devices having electric circuitry that performs a logic operation on input or inputs. For example, the at least one processor may include one or more integrated circuits (IC), including application-specific integrated circuit (ASIC), microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field-programmable gate array (FPGA), server, virtual server, or other circuits suitable for executing instructions or performing logic operations. The instructions executed by at least one processor may, for example, be pre-loaded into a memory integrated with or embedded into the controller or may be stored in a separate memory.
In some embodiments, the at least one processor may include more than one processor. Each processor may have a similar construction, or the processors may be of differing constructions that are electrically connected or disconnected from each other. For example, the processors may be separate circuits or integrated in a single circuit. When more than one processor is used, the processors may be configured to operate independently or collaboratively. The processors may be coupled electrically, magnetically, optically, acoustically, mechanically or by other means that permit them to interact.
Consistent with the present disclosure, disclosed embodiments may involve a network. A network may constitute any type of physical or wireless computer networking arrangement used to exchange data. For example, a network may be the Internet, a private data network, a virtual private network using a public network, a Wi-Fi network, a LAN or WAN network, and/or other suitable connections that may enable information exchange among various components of the system. In some embodiments, a network may include one or more physical links used to exchange data, such as Ethernet, coaxial cables, twisted pair cables, fiber optics, or any other suitable physical medium for exchanging data. A network may also include a public switched telephone network (“PSTN”) and/or a wireless cellular network. A network may be a secured network or unsecured network. In other embodiments, one or more components of the system may communicate directly through a dedicated communication network. Direct communications may use any suitable technologies, including, for example, BLUETOOTH™, BLUETOOTH LE™ (BLE), Wi-Fi, near field communications (NFC), or other suitable communication methods that provide a medium for exchanging data and/or information between separate entities.
In some embodiments, machine learning networks or algorithms may be trained using training examples, for example in the cases described below. Some non-limiting examples of such machine learning algorithms may include classification algorithms, data regressions algorithms, image segmentation algorithms, visual detection algorithms (such as object detectors, face detectors, person detectors, motion detectors, edge detectors, etc.), visual recognition algorithms (such as face recognition, person recognition, object recognition, etc.), speech recognition algorithms, mathematical embedding algorithms, natural language processing algorithms, support vector machines, random forests, nearest neighbors algorithms, deep learning algorithms, artificial neural network algorithms, convolutional neural network algorithms, recursive neural network algorithms, linear machine learning models, non-linear machine learning models, ensemble algorithms, and so forth. For example, a trained machine learning network or algorithm may comprise an inference model, such as a predictive model, a classification model, a regression model, a clustering model, a segmentation model, an artificial neural network (such as a deep neural network, a convolutional neural network, a recursive neural network, etc.), a random forest, a support vector machine, and so forth. In some examples, the training examples may include example inputs together with the desired outputs corresponding to the example inputs. Further, in some examples, training machine learning algorithms using the training examples may generate a trained machine learning algorithm, and the trained machine learning algorithm may be used to estimate outputs for inputs not included in the training examples. The training may be supervised or non-supervised, or a combination thereof. In some examples, engineers, scientists, processes and machines that train machine learning algorithms may further use validation examples and/or test examples. For example, validation examples and/or test examples may include example inputs together with the desired outputs corresponding to the example inputs, a trained machine learning algorithm and/or an intermediately trained machine learning algorithm may be used to estimate outputs for the example inputs of the validation examples and/or test examples, the estimated outputs may be compared to the corresponding desired outputs, and the trained machine learning algorithm and/or the intermediately trained machine learning algorithm may be evaluated based on a result of the comparison. In some examples, a machine learning algorithm may have parameters and hyper parameters, where the hyper parameters are set manually by a person or automatically by a process external to the machine learning algorithm (such as a hyper parameter search algorithm), and the parameters of the machine learning algorithm are set by the machine learning algorithm according to the training examples. In some implementations, the hyper-parameters are set according to the training examples and the validation examples, and the parameters are set according to the training examples and the selected hyper-parameters. The machine learning networks or algorithms may be further retrained based on any output.
Certain embodiments disclosed herein may include computer-implemented systems for performing operations or methods comprising a series of steps. The computer-implemented systems and methods may be implemented by one or more computing devices, which may include one or more processors as described herein, configured to process real-time video. The computing device may be one or more computers or any other devices capable of processing data. Such computing devices may include a display such as an LED display, augmented reality (AR), or virtual reality (VR) display. However, the computing device may also be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a user device having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system and/or the computing device can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet. The computing device can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
In the example of
In the example of
To augment the video, computing device 160 may process the video from image device 140 and create a modified video stream to send to display device 180. The modified video may comprise the original image frames with the augmenting information to be displayed to the operator via display device 180. Display device 180 may comprise any suitable display or similar hardware for displaying the video or modified video, such as an LCD, LED, or OLED display, an augmented reality display, or a virtual reality display.
As shown in
As further shown in
Processor(s) 230 may also be communicatively connected via bus or network 250 to one or more I/O device 210. I/O device 210 may include any type of input and/or output device or periphery device. I/O device 210 may including one or more network interface cards, APIs, data ports, and/or other components for supporting connectivity with processor(s) 230 via network 250.
As further shown in
Processor(s) 230 and/or memory 240 may also include machine-readable media for storing software or sets of instructions. “Software” as used herein refers broadly to any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions, when executed by one or more processors 230, may cause the processor(s) to perform the various operations and functions described in further detail herein.
Implementations of computing device 200 are not limited to the example embodiment shown in
Consistent with embodiments of the present disclosure, systems, methods, and computer-readable media are provided for processing real-time video. The systems and methods described herein may be implemented with the aid of at least one processor or non-transitory computer readable medium, such as a CPU, FPGA, ASIC, or any other processing structure(s) or storage medium of the computing device. “Real-time video,” as used herein, may refer to video received by the at least one processor, computing device, and/or system without perceptible delay from the video's source (e.g., an image device). For example, the at least one processor may be configured to receive real-time video captured from a medical image device during a medical procedure, consistent with disclosed embodiments. A medical image device may be any device capable of producing videos or one or more images of a human body or a portion thereof, such as an endoscopy device, an X-ray machine, a CT machine, or an MRI machine, as described above. A medical procedure may be any action performed with the intention of determining, detecting, measuring, or diagnosing a patient condition, such as an endoscopy, a gastroscopy, a colonoscopy, or an enteroscopy. In embodiments where the medical procedure is an endoscopic procedure, the medical procedure may be used to identify objects of interest (e.g., lesions or polyps) in a location in the human body. Locations in the human body may be the rectum, sigmoid colon, descending colon, transverse colon, ascending colon, or cecum. It is to be understood, however, that the disclosed systems and methods may be employed in other contexts and applications.
The real-time video may comprise a plurality of frames, consistent with disclosed embodiments. A “frame,” as used herein, may refer to any digital representation such as a collection of pixels representing of a scene or field of view in the real-time video. In such embodiments, a pixel may represent a discrete element characterized by a value or intensity in a color space (e.g., based on the RGB, RYB, CMY, CMYK, or YUV color models). A frame may be encoded in any appropriate format, such as Joint Photographic Experts Group (JPEG) format, Graphics Interchange Format (GIF), bitmap format, Scalable Vector Graphics (SVG) format, Encapsulated PostScript (EPS) format, or the like. The term “video” may refer to any digital representation of a scene or area of interest comprised of a plurality of frames in sequence. A video may be encoded in any appropriate format, such as a Moving Picture Experts Group (MPEG) format, a flash video format, an Audio Video Interleave (AVI) format, or any other format. A video, however, need not be encoded, and may more generally include a plurality of frames. The frames may be in any order, including a random order. In some embodiments, a video or plurality of frames may be paired with audio.
The plurality of frames may include representations of an object of interest. An “object of interest,” as used herein, may refer to any visual item or feature in the plurality of frames the detection or characterization of which may be desired. For example, an object of interest may be a person, place, entity, feature, area, or any other distinguishable visual item or thing. In embodiments where the plurality of frames comprise images captured from a medical imaging device, for example, an object of interest may include at least one of a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a lesion. Examples of objects of interest in a video captured by an image device may include a polyp (a growth protruding from a gastro-intestinal mucosa), a tumor (a swelling of a part of the body), a bruise (a change from healthy cells to discolored cells), a depression (an absence of human tissue), or an ulcer or abscess (tissue that has suffered damage, i.e., a lesion). Other examples of objects of interest will be apparent from this disclosure.
Although some embodiments are described herein with reference to an object of interest being a polyp, it is to be understood that the disclosed systems and methods are not limited to polyps, but may rather be utilized in other contexts and applications including non-medical applications. A “polyp,” as used herein, may refer to growths or lesions of the gastro-intestinal mucosa, and may more generally be used herein to refer to a candidate tissue the detection and characterization of which may be of interest. Polyps may be characterized based on classification, such as based on a histological classification, a morphological classification, a structural classification, or a malignancy classification. For example, polyps may be histologically classified using the Narrow-Band Imaging International Colorectal Endoscopic (NICE) or the Vienna classification. According to the NICE classification system, a polyp may be one of three (3) types, as follows: (Type 1) sessile serrated polyp or hyperplastic polyp; (Type 2) conventional adenoma; and (Type 3) cancer with deep submucosal invasion. According to the Vienna classification, a polyp may be one of five (5) types, as follows: (Category 1) negative for neoplasia/dysplasia; (Category 2) indefinite for neoplasia/dysplasia; (Category 3) non-invasive low grade neoplasia (low grade adenoma/dysplasia); (Category 4) mucosal high grade neoplasia, such as high grade adenoma/dysplasia, non-invasive carcinoma (carcinoma in-situ), or suspicion of invasive carcinoma; and (Category 5) invasive neoplasia, intramucosal carcinoma, submucosal carcinoma, or the like.
A polyp may be morphologically classified using the Paris classification. According to the Paris classification system, a polyp may be one of three (3) general types, as follows: (Type I) elevated or polypoid forms, such as pedunculated, sessile, and broad-based; (Type II) flat or superficial forms, such as flat and elevated, completely flat, and superficially depressed; and (Type III) excavated forms, including excavated and ulcerated. Type I are often referred to as polypoid forms while Types II and III are often referred to as non-polypoid forms. A polyp may also be structurally classified based on its shape or appearance. For example, a polyp may be classified as benign if its surface is smooth or round in appearance, or as non-benign or malignant if its surface includes abnormal growths or is irregular in appearance. A polyp may also be classified based on malignancy. A malignancy may be based on a degree of invasion of a disease, such as cancer invasiveness. A polyp may, for example, be classified as benign when there is a small or no invasion of cancer in or around the polyp, and the polyp may be classified as malignant or cancerous when there is invasion of cancer in or around the polyp. Other classifications will be apparent based on this disclosure and may be selected according to the particular application. Accordingly, the present disclosure is not limited to any particular classification or type of object of interest.
A polyp may also be characterized based on its size. A polyp size may be expressed as a numeric value or a size classification. The size of a polyp may be, for example, expressed using any suitable metric value such as millimeters (mm) although any other metric may be used (e.g., inches). A polyp may thus have a size of 1 mm, 5 mm, 10 mm, and so forth. A polyp size may also be expressed as a classification based on one or more suitable size categories, such as (1) “diminutive” or “small” for polyps having a size less than or equal to 5 mm, (2) “non-diminutive” or “large” for polyps having a size between 6 mm and 9 mm, and (3) “very large” for polyps having a size equal to or greater than 10 mm. As will be appreciated, other values, categories, or labels may be used. Other size representations may be employed depending on the particular application and object of interest.
The at least one processor of computing device 160 (
For example, the presence of the polyps may be detected using one or more machine learning detection networks or algorithms, conventional detection algorithms, or a combination of both, consistent with the disclosed embodiments. The detection of the polyps may include a determined location of the polyp in the frame and be indicated using any suitable graphical representation, which may be overlaid over a frame in which it is detected. In
The at least one processor of computing device 160 may be configured to apply one or more neural networks that implement a trained characterization network configured to determine a plurality of features associated with the object of interest from the plurality of frames, consistent with the disclosed embodiments. The trained characterization network may comprise one or more suitable machine learning networks or algorithms for determining a plurality of features associated with the object of interest, including one or more neural networks (e.g., a deep neural network, a convolutional neural network, a recursive neural network, etc.), a random forest, a support vector machine, or any other suitable model as described above trained to determine features of the object of interest in a plurality of objects of interest. The characterization network may be trained using a plurality of training frames or portions thereof labeled based on the desired features (e.g., classification, size, or location). For example, a first set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as having a feature, and a second set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as not having the feature. Weights or other parameters of the characterization network may be adjusted based on its output with respect to a third, non-labeled set of training frames (or portions of frames) until a convergence or other metric is achieved, and the process may be repeated with additional training frames (or portions thereof) or with live video or frame data, as further described herein.
In some embodiments, the characterization network of computing device 160 may be implemented with a plurality of trained neural networks wherein each trained neural network is adapted to determine a specific characterization or feature of the identified object from the real-time video or frames. For example, the plurality of trained neural networks may include at least one trained neural network to determine the classification of the object, at least one trained neural network to determine the size of the object, and at least one trained neural network to determine the location of the object. As part of computing device 160, each of the trained neural networks may be arranged to operate contemporaneously or in parallel with one another to more efficiently determine characteristics and other information of the object based on the real-time video or frames from image device 140. For example, for each identified object, the trained neural networks of the characterization network may be applied and simultaneously operated in parallel to one another to more efficiently determine the characterizations of the object. Optimizing the arrangement of the trained neural networks also enables computing device 160 to generate the augmented video display, including all determined information related to the detected object, with little or no perceived delay or latency by the clinician or operator viewing the output on display device 180 while performing the medical procedure with image device 140. In some embodiments, the trained neural networks of computing device 160 are also implemented to determine other information, such as a confidence value or medical guideline, as further disclosed herein.
In some embodiments, the trained characterization network of computing device 160 may be configured to determine confidence values associated with the plurality of features. A confidence value for an identified feature may refer to an indication of the level of certainty associated with the identified feature. For example, a confidence value of 0.9 or 90% may indicate that there is a ninety percent certainty that the identified feature is present in an object of interest, while a confidence value of 0.4 or 40% may indicate that there is a forty percent certainty that the identified feature is present, and so forth. Other values, metrics, or representations may be used to represent a confidence value, however, such as alphabetical characters (e.g., “A” for a high confidence value and an “F” for a low confidence value), colors (e.g., green for a high confidence value and red for a low confidence value), shapes (e.g., a check for a high confidence value and a cross for a low confidence value), or any other suitable representation.
The at least one processor of computing device 160 may be configured to apply one or more neural networks to determine a confidence value. In some embodiments, the confidence value associated with an output may be implicitly defined in one hot encoding formulation, where for each possible class a score is output by the network. During training, neural network calibration methods may be used such as mixup and label smoothing to control the range and distribution of confidence values or scores. Alternatively, in other embodiments, a dedicated output node is added to each neural network that provides a confidence estimation and an abstention term can be included in the loss function to train the neural network accordingly. With the extra term in the loss function, the neural network can predict low confidence values when the estimation error is high due to, for example, low quality or cluttered images. In still other embodiments, the neural network is trained to predict both the output and confidence score or label.
In some embodiments, each neural network is trained to obtain in a validation step a correspondence between a threshold value of a confidence score output by the network and the performance of a specific task, such as characterizing an identified object in an input image frame. In the validation stage, an independent and labeled dataset may be used to obtain a correspondence between the thresholds on confidence scores from the neural networks and each specific characterization or feature of the object. Computing device 160 may determine the confidence score threshold for achieving an expected performance level of a neural network performing a specific task of characterizing an identified object. For example, a trained neural network may be discovered to attain a performance sensitivity of 99% when performing a particular task when the confidence score threshold is set to 0.6. Computing device 160 may then execute a neural network to select image frames with an expected performance level of a specific task based on those that satisfy the determined confidence score threshold. Further, computing device 160 may use the desired performance level to implement medical guidelines when using frames from image device 140. One or more metrics may be used for measurement, such as accuracy, precision, recall, sensitivity, npv, specificity, etc. This approach permits the correlation to be obtained between a given threshold on the confidence score output by the neural networks and the expected accuracy of the determined characterization or feature for an identified object.
In some embodiments, the at least one processor of computing device 160 may be configured to apply one or more specific machine learning networks or algorithms trained to detect one or more specific features. For example, the at least one processor may be configured to apply one or more neural networks that implement a trained classification network configured to determine a classification of the object of interest. In some embodiments, the trained classification network may also be configured to generate a classification confidence value associated with the determined classification. The classification may be the same or similar as those described above (e.g., based on at least one of a histological classification, a morphological classification, a structural classification, or a malignancy classification). The classification network may be trained using a plurality of training frames or portions thereof labeled based on one of more classifications. For example, a first set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as “adenoma”, and a second set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as “non-adenoma” or another classification (e.g., “serrated”). Other labeling conventions could be used both in binary (e.g. “hyperplastic” vs “non-hyperplastic”) and in multiple classes (e.g. “adenoma” vs “sessile serrated” vs “hyperplastic”). Weights or other parameters of the classification network may be adjusted based on its output with respect to a third, non-labeled set of training frames (or portions of frames) until a convergence or other metric is achieved, and the process may be repeated with additional training frames (or portions thereof) or with live data as described herein.
Also, the at least one processor of computing device 160 may be configured to apply one or more neural networks that implement a trained location network configured to determine a location associated with the object of interest. The trained location network may also be configured to generate a location confidence value associated with the determined location. The location may be the same or similar as those described above (e.g., a location in a human body, such as the rectum, sigmoid colon, descending colon, transverse colon, ascending colon, or cecum). The location network may be trained using a plurality of training frames or portions thereof labeled based on one or more locations (e.g., body locations). For example, a first set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as “sigma rectum”, and a second set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as “not sigma rectum” or another body location (e.g., “ascending colon”). Weights or other parameters of the location network may be adjusted based on its output with respect to a third, non-labeled set of training frames (or portions of frames) until a convergence or other metric is achieved, and the process may be repeated with additional training frames (or portions thereof) or with live data as described herein.
Furthermore, the at least one processor of computing device 160 may be configured to apply one or more neural networks that implement a trained size network configured to determine a size associated with the object of interest, and in some embodiments the trained classification network may also be configured to generate a size confidence value associated with the determined size. The size may be the same or similar as those described above (e.g., a numeric value or a size classification). The location network may be trained using a plurality of training frames or portions thereof labeled based on size. For example, a first set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as “diminutive” or “small,” and a second set of training frames (or portions of frames) containing or not containing an object of interest may be labeled as “non-diminutive” or “not small” or another size value (e.g., “large” or “10 mm”). Weights or other parameters of the location network may be adjusted based on its output with respect to a third, non-labeled set of training frames (or portions of frames) until a convergence or other metric is achieved, and the process may be repeated with additional training frames (or portions thereof) or with live data as described herein.
Consistent with the description above, the trained classification, location, and/or size networks may be stored in the computing device and/or system, or they may be fetched from a network or database prior to characterization. In some embodiments, the trained classification, location, and/or size networks may be re-trained based on one or more outputs, such as true or false feature detections. The feedback for re-training may be generated automatically by the system or the computing device, or it may be manually inputted by the operator or another user (e.g., through a mouse or keyboard or other input device). Weights or other parameters of the trained classification, location, and/or size networks may be adjusted based on the feedback. In some embodiments, conventional non-machine learning detection algorithms may be used, either alone or in combination with the trained classification, location, and/or size networks.
The at least one processor of computing device 160 may be configured to identify, based on one or more of the plurality of features and/or the confidence values, a medical guideline. A “medical guideline,” as used herein, may refer to any information provided with the aim of aiding in the determination, diagnosis, or treatment of a patient condition. For example, in embodiments where an object of interest is a polyp or other formation in or on human tissue, the identified medical guideline may include an instruction to leave or resect the object of interest. In some embodiments, when the medical guideline includes an instruction to resect the object of interest, the medical guideline may also include an identification or description of a specific type of resection. For example, the medical guideline may include an instruction to perform Endoscopic Mucosal Resection (EMR) to resect small polyps or precancerous growths, or to perform Endoscopic Submucosal Dissection (ESD) to resect large polyps or likely-cancerous growths, or any other type of resection. Other medical guidelines may be identified based on the specific application or context, however, such as examining the object of interest further, performing other medical examinations, performing an operation, prescribing a drug or medicine, or administrating other treatments. The at least one processor may also be configured to present, in real-time on a display device during capture (e.g., during a medical procedure), information for the identified medical guideline, as described above. The displayed information for the identified medical guideline may be in any suitable representation, such as one or more alphanumeric characters (e.g., the words “leave” or “resect”) or abbreviations or text indicating more specifically the type of resection (e.g., “EMR” or “EDS”), shapes (e.g., a check mark or a cross sign), colors (e.g., green or red), images (e.g., an image of a hand or a medical instrument), videos (e.g., a video of a suggested procedure), or any combination thereof. As a further example, in an augmented video display, information related to a medical guideline may be displayed separately and/or in proximity to the object of interest (see, e.g.,
The at least one processor may be configured to generate a confidence value associated with the identified medical guideline. A confidence value for an identified medical guideline may refer to an indication of the level of certainty associated with the identified medical guideline. For example, a confidence value of 0.9 or 90% may indicate that there is a ninety percent certainty that the identified medical guideline is correct, while a confidence value of 0.4 or 40% may indicate that there is a forty percent certainty that the identified medical guideline is correct, and so forth. Other values, metrics, or representations may be used to represent a confidence value, however. For example, a confidence value may be categorized as “high confidence” when the confidence value is above a predetermined threshold (e.g., above 66% confidence), “low confidence” when the confidence value is below a predetermined threshold (e.g., below 33% confidence), or “undetermined” when between two predetermined thresholds (e.g., 33% to 66% confidence). In some embodiments, the at least one processor may also present a confidence score associated with the medical guideline. For example, a confidence score may be represented as alphanumeric characters along with the medical guideline, such as “leave—high confidence” or “resect—30% confidence,” although any other suitable representation may be used (e.g., a shape, a color, an image, a video, or a combination thereof).
In some embodiments, the at least one processor of computing device 160 may be configured to apply one or more neural networks to determine a confidence value associated with a medical guideline. For example, as part of or after training the neural networks, a validation stage may be used to associate a confidence value and a medical guideline. In this stage, an independent and labeled dataset may be used to obtain a correspondence between the thresholds on confidence scores from the neural networks and the performance on the task of interest measured according to multiple metrics such as accuracy, precision, recall, sensitivity, npv, specificity, etc. This approach permits the correlation to be obtained between a given threshold on the confidence score output by the neural networks and the expected performance in a real case scenario.
In embodiments where the characterization network determines an object of interest's classification, body location, and/or size, the at least one processor may be configured to identify, based on one or more of the classification, the location, and the size, the medical guideline. For example, in embodiments where the object of interest is a polyp, the medical guideline may be to leave the polyp when it is classified as hyperplastic and to resect the polyp when it is classified as dysplastic or neoplastic. Likewise, the medical guideline may be to leave the polyp when it is determined to be equal to or less than 5 mm or “diminutive” in size and to resect the polyp when it is determined to be greater than 5 mm or “small” or “large” in size. Similarly, the medical guideline may be to leave the polyp when it is determined to be in a non-harmful body location and to resect the polyp when it is determined to be in a harmful body location. In some embodiments, the medical guideline may be determined based on a combination of the classification, the location, and the size. For example, the medical guideline may be to leave the polyp when it is determined to be a hyperplastic polyp located in the sigma rectum having a size that is equal to or less than 5 mm in size. As another example, the medical guideline may be to leave the polyp when it is determined to be a hyperplastic polyp located in the rectum having a “diminutive” size. On the other hand, the medical guideline may be to resect the polyp when it is determined to be an adenoma located in the caecum having a “diminutive” size. Likewise, the medical guideline may be to resect the polyp when it is determined to be an adenoma located in the ascending colon having a “non-diminutive” size. In some embodiments, a confidence value associated with the relevant characterization may be used to determine the medical guideline. For example, the medical guideline may be to resect the polyp only when there is a ninety percent confidence value that the polyp is neoplastic, 10 mm or “large” in size, and/or in a harmful body location, and to leave the polyp otherwise. Confidence values could be expressed also as “high confidence” or “low confidence”. Other confidence values and characterizations may be used, however, as described above. The medical guideline determinations disclosed herein are provided for illustrative purposes only and are not intended to be exhaustive, and other medical guidelines will be apparent to those having ordinary skill in the art.
Object detector 420 may comprise one or more machine learning detection networks or algorithms, conventional detection algorithms, or a combination of both, as described above with respect to the embodiment of
Characterization network 430 may include one or more trained machine learning algorithms (e.g., one or more neural networks) configured to determine a plurality of features for each object of interest detected by object detector 420, as described above with respect to the embodiment of
In some embodiments, characterization network 430 may be implemented with a plurality of trained neural networks (e.g., networks 440, 450, and 460 and/or networks for determining confidence values and medical guidelines) that are arranged in operate in parallel with one another to more efficiently determine characteristics and other information related to the identified object from the real-time video or frames. For example, the plurality of trained neural networks may include at least one trained neural network (i.e., classification network 440) to determine the classification of the object, at least one trained neural network (i.e., location network 450) to determine the location of the object, and at least one trained neural network (i.e., size network 460) to determine the size of the object, as well as trained neural networks for determining confidence values and/or medical guidelines. Optimizing the arrangement of trained neural networks (e.g., networks 440, 450, and 460 as well as networks for determining confidence values and medical guidelines) to operate simultaneously in parallel with one another enables all information related to each identified object to be determined efficiently and enables characterization network 430 to generate the augmented video display, including the determined information related to the detected object, with little or no perceived delay by the clinician or operator viewing the augmented video display on display device 470 while performing a medical procedure with image device 410.
In some embodiments, classification network 440, location network 450, and size network 460 may be implemented to provide multiple output values at the same time for each identified object. For example, classification network 440 could be configured to provide as output an optical characterization prediction (e.g., adenoma, hyperplastic, ssl, etc.) as well as a morphology estimation (e.g., sessile, peduncolated, etc.) and a pit-pattern description (Type I, Type II, Type III, . . . ). This may be achieved through a branching in the final layers of the neural network of classification network 440, wherein each branch infers a specific classification. Additionally, or alternatively, multiple instances of networks 440-460 may be instantiated to operate simultaneously in parallel to determine multiple characterizations or features with respect to one or a plurality of detected objects.
For each identified object, classification network 440, location network 450, and size network 460 may be implemented to process the entire frame and/or an image patch around the object of interest. The output of each network 440, 450, and 460 will be one or more predictions for a given class or characteristic or an estimated value (regression) and may include a confidence score, as disclosed above with respect to the
In the example of
Real-time processing system 400 may receive frames of a video from image device 410, process them, and provide in real-time (i.e., simultaneously or approximately at the same time that a physician or operator is performing the medical procedure) an augmented video display with relevant information to an operator of image device 410 for objects of interest identified in the frames. As disclosed herein, characterization network 430 of real-time processing system 400 may be optimized by arranging the trained neural networks (e.g., networks 440, 450, and 460 and/or networks for determining confidence values and medical guidelines) to process frames in parallel (or substantially at the same time) and provide their output more efficiently. With such an arrangement, the augmented video display with all determined information may be presented with little or no perceived delay by the clinician or operator performing the endoscopy or other medical procedure.
In some embodiments, object detector 420 and/or characterization network 430 may run multiple instances of machine learning detection networks in parallel against the plurality of frames of the real-time video from image device 410. Additionally, or alternatively, the frames may be buffered for processing by the trained neural networks. For example, real-time processing system 400 may buffer frames from image device 410 and provide them as input to the neural networks. At each iteration, networks 440-460 may take as input N image frames buffered by system 400. In some embodiments, networks 440-460 process the current frame along with the past N−1 image frames with the output of each network for the current frame also being dependent on the past N−1 frames. With this buffering implementation, real-time processing can be provided with one of three options: (i) there is no output for the first N−1 frames or (ii) the output for the first N−1 frames only depends on the current frame (i.e., there is no buffering in the initial phase) or (iii) the output for the first N−1 frames only depends on the last frame and all the previous ones available. Additionally, there could be other intervals during which one or more of the networks 440-460 are not providing an output. During intervals where there is no output, real-time processing system 400 may communicate the status of the system by causing appropriate messages to be displayed (e.g., status messages such as “processing,” “buffering,” or “analyzing”) via display device 470.
In some embodiments, real-time processing system 400 may determine the number of instances of trained neural networks to run in parallel based on operator input and/or relevant processing parameters (e.g., the frame rate of the video generated by image device 410 and/or a frame buffer size). Real-time processing system 400 may also process only certain frames or areas of interest identified by object detector 420 to include objects. Further, real-time processing system 400 may selectively process frames and regions within the frames based on available system resources and/or performance requirements. Alternatively, or additionally, in other embodiments, real-time processing system 400 may adjust the size of the input object identified by object detector 420 based on a neural network used to determine a characterization. Real-time processing system 400 may adjust identified object size by adjusting the resolution of an image frame or buffer size of image frames with an identified object.
In some embodiments, real-time processing system 400 may control processing based on the number of frames and/or detected objects in the frames. For example, real-time processing system 400 may adjust the processing of frames to keep up with the frame rate of image device 410. In some embodiments, real-time processing system 400 may adjust the buffer size or length in view of the frame rate of image device 410. In some embodiments, real-time processing system 400 may keep up with the frame rate by processing multiple frames in parallel. Real-time processing system 400 may determine the number of neural network instances to run in parallel based on the frame rate of image device 410. Real-time processing system 400 may also determine the number of instances of neural networks 440-460 to run in parallel based on other real-time processing requirements or factors such as processing time delay(s) or restriction(s) due to available system resources (e.g., available hardware and software resources) and accuracy requirements in detecting objects and features of interest of detected objects. Real-time processing system 400 may also achieve real-time processing requirements by adjusting the sampling rate to select a subset of frames from image device 410. Additionally, or alternatively, real-time processing system 400 may sample frames and/or persistent objects detected in the received frames to meet real-time processing requirements.
In some embodiments, real-time processing system 400 may skip execution of one or more trained neural networks depending on operator input or settings (such as a command not to include confidence values and/or medical guidelines and/or command(s) selecting which of the object characterization features to include for processing). The skipping of frames for processing may also be done when there is a lack of an identified object over one or more frames. Additionally, or alternatively, real-time processing system 400 may skip or pause execution of one or more trained neural networks depending on the operating mode of image device 410 (e.g., cleaning versus navigating) and/or the location image device 410 in or relative to a patient's body or organ during the medical procedure. For example, when real-time processing system 400 determines that an endoscope device is out of a patient's colon it may deactivate one or more of the neural networks of the system. Additionally, or alternatively, real-time processing system 400 may skip execution of one or more of the networks 440-460 based on actions on objects or the status of object detector 420. For example, real-time processing system 400 may deactivate the neural networks 440-460 during a resection/surgery of an object identified by object detector 420 or while an operator is performing other tasks (e.g., lesion insufflation) on objects of interest. While one or more networks 440-460 are deactivated, real-time processing system 400 may continue to receive information regarding detected objects and/or features of interest from object detector 420 and/or other system components (e.g., a frame quality network, an object tracker, an aggregator, etc.; see
Although not shown in
As disclosed herein, the trained neural networks of characterization network 430 may be arranged to operate in parallel to more efficiently determine the characterizations (e.g., classification, location, and size) of identified objects during a medical procedure. Referring now
Encoder network 485 and latent representation 486 may be implemented with one or more neural networks that are trained using a combination of unsupervised reconstruction loss and a supervised loss based on classification, location, and size tasks. Additionally, or alternatively, encoder network 485 can be trained with a loss from the contrastive loss family such as triplet loss or quadruplet loss which enforces a structured organization of the latent space. In this way, the latent space can assign a similar representation to image frames belonging to the same object and a more robust distance metric can be defined between latent representations. As discussed above, encoder network 485 may embed the inherent structure of the detected object(s) by projecting into a latent space, for example, latent representation 486. Encoder network 485 together with latent representation 486 may process image frame(s) or surrounding area(s) with each detected object to project into latent space by encoding layers of the network resulting in a latent vector in a lower dimension than the detected object(s) in the processed frames or surrounding areas. As discussed above, this provides several benefits including reducing storage requirements and improving processing efficiencies with respect to the object data.
In some embodiments, a tracking module (not shown in
For each object, the embedded representation in latent space (i.e., the output of latent representation 486) may be fed in parallel to the three characterization networks (i.e., classification network 440, location network 405, and size network 460) to determine the characteristics or features for the object. Advantageously, with this implementation, the trained neural networks 440-460 will be small (i.e., just a few fully connected layers) since the encoding part is shared and performed within the encoder network 485. This reduces the overall computational cost and efficiency of the characterization network 430. Consequently, real-time processing system 480 benefits from a reduction in time needed to process and characterize objects of interest and provide output to display device 470.
Characterization network 430 may aid in determining characteristic features of objects of interest in frames accessed directly from image device 410 or storage (e.g., storage 220 or buffer device with memory) containing previously generated image frames. Characterization network 430 allows inclusion of various networks for simultaneously determining various characteristic features of objects and optimizing the processing of image frames when determining characteristic features.
Characteristic network 430 may be composed of multiple networks and configured to select and unselect various configurations of networks for identifying objects of interest in image frames and determining their characteristics in an optimized manner. Characteristic network 430 may also be configured to have multiple copies of the same network to parallel process image frames or patches of an image frame to determine characteristics. In some embodiments, characteristic network 430 optimizes frame processing by selecting the required networks and configuring the order of networks to pre-process image frames by executing common operations across networks. For example, characterization network 430 may utilize encoder network 485 and latent representation 486 to pre-process images and have smaller networks to determine characteristics in an efficient manner while using less computational resources.
Furthermore, as discussed above, a medical guideline may also be displayed. As shown in
In some embodiments, additional modules or processes may be provided and performed before, after, or concurrently with characterization network 430. For example, in some embodiments, the at least one processor may be configured to apply one or more neural networks that implement a trained quality network configured to determine a frame quality associated with at least one of the plurality of frames. A “frame quality,” as used herein, may refer to a degree of visual clarity of one or more frames for purposes of performing the operations described herein. A frame quality may be based on any visual characteristic, such as blurriness, sharpness, brightness, lighting, exposure, contrast, movement, visibility, or any other feature of one or more frames. The trained frame quality network may be trained to generate a numeric value and/or a quality classification associated with a frame quality. For example, the trained frame quality network may be configured to output a number (e.g., 0.7) associated with the frame quality, or it may be configured to assign a quality class to a frame (e.g., “sufficient quality” or “not sufficient quality”).
The trained frame quality network may comprise one or more suitable machine learning networks or algorithms for determining a quality value associated with one or more frames in the real-time video, including one or more neural networks (e.g., a deep neural network, a convolutional neural network, a recursive neural network, etc.), a random forest, a support vector machine, or any other suitable model as described above trained to determine a frame quality. The frame quality network may be trained using a plurality of training frames or portions thereof labeled based on one or more quality values or classifications. For example, a first set of training frames (or portions of frames) may be labeled as “sufficient quality,” and a second set of training frames (or portions of frames) may be labeled as “not sufficient quality.” Weights or other parameters of the frame quality network may be adjusted based on its output with respect to a third, non-labeled set of training frames (or portions of frames) until a convergence or other metric is achieved, and the process may be repeated with additional training frames (or portions thereof) or with live data as described herein. The trained frame quality network may be stored in the computing device and/or system, or it may be fetched from a network or database prior to determining the frame quality. In some embodiments, the trained frame quality network may be re-trained based on one or more of its outputs, such as accurate or inaccurate frame quality detections. The feedback for re-training may be generated automatically by the system or the computing device, or it may be manually inputted by the operator or another user (e.g., through a mouse or keyboard or other input device). Weights or other parameters of the trained frame quality network may be adjusted based on the feedback. In some embodiments, conventional non-machine learning frame quality detection networks or algorithms may be used, either alone or in combination with the trained frame quality network.
In some embodiments, the trained frame quality network may be configured to generate a confidence value associated with the determined frame quality. A confidence value for a determined frame quality may refer to an indication of the level of certainty associated with the determined frame quality. For example, a confidence value of 0.9 or 90% may indicate that there is a ninety percent certainty that the determined frame quality is correct, while a confidence value of 0.4 or 40% may indicate that there is a forty percent certainty that the determined frame quality is correct, and so forth. Other values, metrics, or representations may be used to represent a confidence value, however. For example, a confidence value may be categorized as “high confidence” when the confidence value is above a predetermined threshold (e.g., above 66% confidence), “low confidence” when the confidence value is below a predetermined threshold (e.g., below 33% confidence), or “undetermined” when between two predetermined thresholds (e.g., 33% to 66% confidence). In some embodiments, the at least one processor may be configured to present, in real-time on the display device, information for the determined frame quality in any suitable format (e.g., the frame quality value and/or classification, a thumbs up or thumbs down, a check mark or cross sign, or a color).
The at least one processor of the computing device or system may be configured to aggregate data associated with a determined characterization (e.g., classification, location, and/or size) when at least one of the determined frame quality or the confidence value is above a predetermined threshold, consistent with the embodiments of the present disclosure. Aggregation in this context may refer to any operation for combining, collecting, or receiving multiple data. For example, in embodiments where the classification, location, and size of an object of interest in a frame are determined, the at least one processor may be configured to collect the classification, location, and size determination from the characterization network only when the determined frame quality from the frame is above the predetermined threshold (e.g., having a frame quality greater than 0.4 or being classified as “sufficient quality”). In some embodiments, the at least one processor may be configured to present, on the display device, at least a portion of the aggregated data. The aggregated data may be displayed in the same or similar manner as described above (e.g., using an LED display, virtual reality display, or augmented reality display).
Other information may be aggregated, such as other determined features, and other metrics may be used to determine whether to aggregate the data depending on the specific application or context. In some embodiments, for example, the at least one processor may be configured to aggregate, when the object of interest persists over more than one of the plurality of frames, information associated with the determined features (e.g., location and size) of the object of interest. A “persistence,” or variations thereof as used herein, may refer to an object of interest's continued presence in a location of one or more frames. A persistence may be determined using any process for comparing the presence of an object of interest in one or more frames. For example, an Intersection over Union (IoU) value for the location of an object of interest in two or more image frames may be calculated, and the IoU value may be compared with a threshold to determine whether the object of interest persists over more than one of the frames. An IoU value may be estimated using the following formula:
In the above IoU formula, Area of Overlap is the area where the object of interest is present in two or more frames, and Area of Union is the total area where the object of interest is present in the two or more frames. As a non-limiting example, an IoU value above 0.5 (e.g., approximately 0.6 or 0.7 or higher, such as 0.8 or 0.9) between two consecutive frames may be used to determine that the object of interest persists in the two consecutive frames. In contrast, an IoU value below 0.5 (e.g., approximately 0.4 or lower) between the same may be used to determine that the object of interest does not persist. Other methods of determining a persistence may be used, however, depending on the application or context. When the object is determined to persist over more than a plurality of frames, information associated with the determined features (e.g., location and size) of the object of interest may be aggregated, in the same or similar manner as discussed above. The at least one processor may be configured to present, on a display device, the aggregated data or a portion thereof. In this manner, information may be displayed only for objects of interest that are sufficiently present in two or more frames, so as to avoid displaying useless or distracting information during capture (e.g., during a medical procedure).
In some embodiments, aggregated information may be displayed based on one or more criteria. For example, the at least one processor may be configured to present, on a display device, when the determined location is in a first body region and the determined size is within a first range, the aggregated information (e.g., location and size) for the object of interest. Further, the at least one processor may be configured to present, on the display device, when the determined location is in a second body region and the determined size is within a second range, information indicating a status of the characterization of the object of interest (i.e., non-aggregated information). As a non-limiting example, in embodiments where the object of interest is a polyp and the aggregated information is location and size, classification information (i.e., non-aggregated information) may be displayed when the polyp is determined to be diminutive and located in a body location other than the sigma rectum. Conversely, the location and size (i.e., the aggregated information) may be displayed when the polyp is determined to be non-diminutive or located in the sigma rectum. In this manner, only relevant information may be displayed to an operator based on predetermined aggregation rules and/or display criteria so as to provide only critical information during capture (e.g., during a medical procedure).
In
In some embodiments, frame quality network 620 may classify frames from an image device and/or patches of frames containing an object of interest detected by object detector 610 based on a set of classes learned in training. Frame quality network 620 may output a confidence value of its output by learning implicitly during training using, for example, one-hot-encoding formulation. Alternatively, a dedicated output node may be added to frame quality network 620 to provide a confidence estimation and an abstention term included in the loss function to train the network. Frame quality network 620 may use the abstention term to predict low confidence values when the estimation error is high due to, for example, low quality or cluttered image frames. In some embodiments, frame quality network 620 may learn to generate confidence values of its output explicitly if a confidence value of a ground truth value is available for the training data used to train frame quality network 620. In some embodiments, frame quality network 620 may use neural network calibration methods such as mixup and label smoothing during training to control the range and distribution of confidence values. One or more neural networks may be used to implement frame quality network 620 and may be trained to predict both the output and a confidence value or label uncertainty, if available.
Characterization network 630 may identify one or more features and/or confidence values associated with the one or more features of the detected object of interest, as discussed above. In embodiments where characterization network 630 receives a frame quality and/or a confidence value associated with the frame quality from frame quality network 620, characterization network 630 may be configured to detect (or provide an output for) features of the object of interest only in frames where the frame quality and/or the confidence value associated with the frame quality are above a predetermined threshold (e.g., greater than 0.4 or classified as “sufficient quality”). Tracker 640 may be configured to determine a persistence of a detected object of interest, as described above. In some embodiments, tracker 640 may receive feature detections from characterization network 630 and may be configured to only provide them as output when it determines that the detected object of interest persists over more than one of the frames or a predetermined number of frames. In some embodiments, tracker 640 may be configured to receive a frame quality and/or a confidence value associated with the frame quality, which it may utilize with its persistence determination to determine whether to provide an output (e.g., it may provide an output only when the detected object of interest persists in more than one of the frames or a predetermined number of frames and the frame quality and/or confidence value are above a predetermined threshold).
Aggregator 650 may receive outputs from any of the previously mentioned components, including object detector 610, frame quality network 620, characterization network 630, and/or tracker 640, and it may aggregate or combine any of the received information based on one or more criteria for presentation on display 660, as discussed above. For example, aggregator 650 may receive information associated with features detected by characterization network 630 to determine what features to aggregate depending on predefined criteria. For example, aggregator 650 may output, to display device 660, when the determined location is in a first body region and the determined size is within a first range, the aggregated location and size information for the object of interest. Additionally, or alternatively, aggregator 650 may output, to display device 660, when the determined location is in a second body region and the determined size is within a second range, information indicating a status of the characterization of the object of interest instead. By way of example, the status of the characterization of the object of interest provided by the aggregator 650 may include the status of the aggregation to inform the operator or user as to whether there is aggregated information or only non-aggregated information. Aggregator 650 may use other rules and criteria to output information for presentation on display device 660, as discussed above.
As shown in
At steps 730, 740, and 750, a classification network, a location network, and a size network may be applied to determine a classification, location, and size of the detected object of interest, respectively, as discussed above. At step 760, if the frame has a sufficient frame quality, at least a portion or set of the classification, location, and size information is aggregated. At step 770, one or more criteria is applied to determine what portion of the aggregated data to present for display (e.g., whether the determined location is in a first body region and the determined size is within a first range, or whether determined location is in a second body region and the determined size is within a second range), as discussed above. If a first set of criteria are met, then at step 780 a first portion of the aggregated information may be displayed. If a second set of criteria are met, then at step 790 a second portion of the aggregated information may be displayed. Although not shown in
The diagrams and components in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer hardware or software products according to various example embodiments of the present disclosure. For example, each block in a flowchart or diagram may represent a module, segment, or portion of code, which includes one or more executable instructions for implementing the specified logical functions. It should also be understood that in some alternative implementations, functions indicated in a block may occur out of order noted in the figures. By way of example, two blocks shown in succession may be executed or implemented substantially concurrently, or two blocks may sometimes be executed in reverse order, depending upon the functionality involved. Furthermore, some blocks may also be omitted. It should also be understood that each block of the diagrams, and combination of the blocks, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or by combinations of special purpose hardware and computer instructions. Computer program products (e.g., software or program instructions) may also be implemented based on the described embodiments and illustrated examples.
It should be appreciated that the above-described systems and methods may be varied in many ways, including omitting or adding steps, changing the order of steps and the type of functions and/or components used. It should also be appreciated that different features may be combined in different ways. In particular, not all the features shown above in a particular embodiment or implementation are necessary in every embodiment or implementation. Further combinations of the above features and implementations are also considered to be within the scope of the herein disclosed embodiments or implementations.
While certain embodiments and features of implementations have been described and illustrated herein, modifications, substitutions, changes and equivalents will be apparent to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes that fall within the scope of the disclosed embodiments and features of the illustrated implementations. It should also be understood that the herein described embodiments have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the systems and/or methods described herein may be implemented in any combination, except mutually exclusive combinations. By way of example, the implementations described herein can include various combinations and/or sub-combinations of the functions, components and/or features of the different embodiments described.
Moreover, while illustrative embodiments have been described herein, the scope of the present disclosure includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations or alterations based on the embodiments disclosed herein. Further, elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described herein or during the prosecution of the present application. Instead, these examples are to be construed as non-exclusive. Further, the steps of the disclosed methods can be modified in any manner, including by reordering steps or inserting or deleting steps.
By way of further example, systems and methods consistent with the present disclosure include the following implementations and aspects.
A computer-implemented system for processing real-time video, the system comprising at least one processor configured to: receive a real-time video captured from a medical image device during a medical procedure, the real-time video comprising a plurality of frames; detect an object of interest in the plurality of frames; encode the object of interest in the plurality of frames to generate an embedded representation using an encoder network processing an area surrounding the object of interest; generate a latent representation of the encoded object of interest; apply one or more neural networks that implement a trained characterization network to the latent representation to determine one or more characteristics of the object of interest; modify the real-time video with augmenting information of the detected object of interest and the one or more characteristics of the object of interest; and present the modified video on a display device during the medical procedure.
In the above-described system, the medical procedure may include at least one of an endoscopy, a gastroscopy, a colonoscopy, or an enteroscopy. Further, the object of interest may include at least one of a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a lesion.
In the above-described system, the at least one processor may be further configured to identify, based on the one or more characteristics of the object of interest, a medical guideline. The at least one processor may also be configured to present information related to the identified medical guideline as part of the modified video on the display device. By way of example, the information related to the identified medical guideline may include an instruction to leave or resect the object of interest. As a further example, the information related to the identified medical guideline includes a type of resection.
In the above-described system, the at least one processor may be further configured to generate a confidence value associated with the identified medical guideline.
In the above-described system, the determined one or more characteristics of the object of interest may include a classification of the object interest based on at least one of a histological classification, a morphological classification, a structural classification, or a malignancy classification. As a further example, the determined one or more characteristics of the object of interest may include a location associated with the object of interest. In some embodiments, the determined location is a location in a human body or relative to a human organ. Examples of a determined location includes a rectum, sigmoid colon, descending colon, transverse colon, ascending colon, or cecum. As a still further example, the determined one or more characteristics of the object of interest may be a size of the object of interest. The determined size of the object of interest may be represented as a numeric value or a size classification.
In the above-described system, the at least one processor may be further configured to apply one or more neural networks that implement a trained quality network. The trained quality network may be configured to: determine a frame quality associated with one or more of the plurality of frames; and generate a confidence value associated with the determined frame quality.
In the above-described system, the at least one processor may be further configured to: aggregate data associated with the determined one or more characteristics when at least one of the determined frame quality or the confidence value is above a predetermined threshold; and present, on the display device, at least a portion of the aggregated data.
In the above-described system, the at least one processor may be further configured to: detect a plurality of objects of interest in the plurality of frames; determine a plurality of classifications and sizes associated with the plurality of detected objects of interest; and present, on the display device, information associated with one or more determined classifications and sizes.
In the above-described system, the at least one processor may be further configured to track the object of interest in the plurality of frames to determine temporal information related to the object of interest.
A computer-implemented method for processing real-time video, the method comprising the following steps: receiving a real-time video captured from a medical image device during a medical procedure, the real-time video comprising a plurality of frames; detecting an object of interest in the plurality of frames; encoding the object of interest in the plurality of frames to generate an embedded representation using an encoder network processing an area surrounding the object of interest; generating a latent representation of the encoded object of interest; applying one or more neural networks that implement a trained characterization network to the latent representation to determine one or more characteristics of the object of interest; modifying the real-time video with augmenting information of the detected object of interest and the one or more characteristics of the object of interest; and presenting the modified video in real-time on a display device during the medical procedure.
A computer-implemented system for processing real-time video, the system comprising at least one processor configured to: receive a real-time video captured from a medical image device during a medical procedure, the real-time video comprising a plurality of frames; detect an object of interest in the plurality of frames; encode the object of interest in the plurality of frames to generate an embedded representation using an encoder network processing an area surrounding the object of interest; generate a latent representation of the encoded object of interest; track, based on the latent representation, the object of interest in the plurality of frames to determine temporal information of the object of interest; and apply one or more neural networks that implement a trained characterization network to determine one or more characteristics of the object of interest, wherein the one or more characteristics are determined based on at least one of the latent representation and temporal information.
In the above described system, the at least one processor may be further configured to: modify the real-time video with augmenting information of the detected object of interest and the one or more characteristics of the object of interest; and present the modified video on a display device during the medical procedure. The medical procedure may include at least one of an endoscopy, a gastroscopy, a colonoscopy, or an enteroscopy. Further, the object of interest may include at least one of a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a lesion.
In the above-described system, the at least one processor may be further configured to identify, based on the one or more characteristics of the object of interest, a medical guideline. The at least one processor may also be configured to present information related to the identified medical guideline as part of the modified video on the display device. By way of example, the information related to the identified medical guideline may include an instruction to leave or resect the object of interest. As a further example, the information related to the identified medical guideline includes a type of resection.
In the above-described system, the at least one processor may be further configured to generate a confidence value associated with the identified medical guideline.
In the above-described system, the determined one or more characteristics of the object of interest may include a classification of the object interest based on at least one of a histological classification, a morphological classification, a structural classification, or a malignancy classification. As a further example, the determined one or more characteristics of the object of interest may include a location associated with the object of interest. In some embodiments, the determined location is a location in a human body or relative to a human organ. Examples of a determined location includes a rectum, sigmoid colon, descending colon, transverse colon, ascending colon, or cecum. As a still further example, the determined one or more characteristics of the object of interest may be a size of the object of interest. The determined size of the object of interest may be represented as a numeric value or a size classification.
In the above-described system, the at least one processor may be further configured to apply one or more neural networks that implement a trained quality network. The trained quality network may be configured to: determine a frame quality associated with one or more of the plurality of frames; and generate a confidence value associated with the determined frame quality.
In the above-described system, the at least one processor may be further configured to: aggregate data associated with the determined one or more characteristics when at least one of the determined frame quality or the confidence value is above a predetermined threshold; and present, on the display device, at least a portion of the aggregated data.
In the above-described system, the at least one processor may be further configured to: detect a plurality of objects of interest in the plurality of frames; determine a plurality of classifications and sizes associated with the plurality of detected objects of interest; and present, on the display device, information associated with one or more determined classifications and sizes.
In the above-described system, the at least one processor may be further configured to track the object of interest in the plurality of frames to determine temporal information related to the object of interest. To track the object of interest, the at least one processor may be configured to track the object of interest in the plurality of frames based on similarity of latent representations of the object of interest in the plurality of frames. Additionally, or alternatively, to track the object of interest, the at least one processor may be configured to determine the number of frames over which there is a persistence of the object of interest. The at least one processor may also be configured to track the object of interest based on frame quality information for each frame.
A computer-implemented method for processing real-time video, the method comprising: receiving a real-time video captured from a medical image device during a medical procedure, the real-time video comprising a plurality of frames; detecting an object of interest in the plurality of frames; encoding the object of interest in the plurality of frames to generate an embedded representation using an encoder network processing an area surrounding the object of interest; generating a latent representation of the encoded object of interest; tracking, based on the latent representation, the object of interest in the plurality of frames to determine temporal information of the object of interest; and applying one or more neural networks that implement a trained characterization network to determine one or more characteristics of the object of interest, wherein the one or more characteristics are determined based on at least one of the latent representation and temporal information.
In the above described method, the method may be further include: modifying the real-time video with augmenting information of the detected object of interest and the one or more characteristics of the object of interest; and presenting the modified video on a display device during the medical procedure. The medical procedure may include at least one of an endoscopy, a gastroscopy, a colonoscopy, or an enteroscopy. Further, the object of interest may include at least one of a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a lesion.
It is intended, therefore, that the specification and examples herein be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
Number | Date | Country | Kind |
---|---|---|---|
21185179.5 | Jul 2021 | EP | regional |
The present application claims priority to U.S. Provisional Application No. 63/220,585 filed on Jul. 12, 2021 and European Priority Application No. EP21185179.5 filed on Jul. 12, 2021, the entirety of each of the above-referenced applications is hereby incorporated by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/069364 | 7/12/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63220585 | Jul 2021 | US |