The present disclosure relates generally to the field of video processing and image analysis. More specifically, and without limitation, this disclosure relates to systems, methods, and computer-readable media for processing captured video content from an imaging device and performing intelligent image analysis, such as determining the presence of one or more features of interest or actions taken during a medical procedure. The systems and methods disclosed herein may be used in various applications, including for medical image analysis and diagnosis.
In video processing and image analysis systems, it is often desirable to detect objects or features of interest. A feature of interest may be a person, place, or thing. In some applications, such as systems and methods for medical image analysis, the location and classification of a detected feature of interest (e.g., an abnormality such as a formation on or of human tissue) is important for diagnosis of a patient. However, extant computer-implemented systems and methods suffer from a number of drawbacks, including the inability to accurately detect features of interest and/or recognize characteristics related to features of interest. In addition, extant systems and methods are inefficient and do not provide ways to analyze images intelligently, including with regard to the image sequence or presence of events.
Modern medical procedures require precise and accurate examination of a patient's body and organs. Endoscopy is a medical procedure aimed at providing a physician with video images of the internal parts of a patient's body and organs for diagnosis. In the gastrointestinal tract of the human body, the procedure can be performed by introducing a probe with a video camera through the mouth or anus of the patient. During an endoscopic procedure, a physician navigates manually the probe through the gastrointestinal tract while watching in real-time the video on a display device. The video may also be captured, stored, and examined after the endoscopic procedure. As an alternative, capsule endoscopy is a procedure where a capsule containing a small camera is swallowed to examine the gastrointestinal tract of a patient. The sequence of images taken by the capsule during its transit are transmitted wirelessly to a receiving device and stored for examination by the physician after completion of the procedure. The frame rate of capsule device can vary (e.g., 2 to 6 frames per second) and a large volume of images may be taken during an examination procedure.
From a computer vision perspective, the captured content from either a real-time video endoscopy or capsule procedure is a temporally ordered succession of images containing information about a patient, e.g., the internal mucosa of the gastrointestinal tract. Accurate and precise analysis of the captured image data is essential to properly examine the patient and identify lesions, polyps, or other features of interest. Also, there is usually a large number of images collected for each patient. One of the most important medical tasks that needs to be performed by the physician is the examination of this large set of images to make a proper diagnosis including with respect to the presence or absence of features of interest, such as pathological regions in the imaged mucosa. However, going through these images manually is time consuming and inefficient. As a result, the review process can lead to a physician making errors and/or making a misdiagnosis.
In order to improve diagnosis, decrease the time needed for medical image examination, and reduce the possibility of errors, the inventors have determined that it is desirable to have a computer-implemented system and method that is able to intelligently process images and identify the presence of a pathology or other features of interest within all images from a video endoscopy or capsule procedure, or other medical procedure. By way of example, a feature of interest may also include an action being taken on or in the images, an anatomical location or other location of interest in the images, a clinical index level of the images, and so on. Trained neural networks, spatio-temporal image analysis, and other features and techniques are disclosed herein for this purpose. As will be appreciated from this disclosure, the present invention and embodiments may be applied to a wide variety of image capture and analysis applications and are not limited to the examples presented herein.
Embodiments of the present disclosure include systems, methods, and computer-readable media for processing images captured from an imaging device and performing an intelligent image analysis, such as determining the presence of one or more features of interest. Systems and methods consistent with the present disclosure can provide benefits over extant systems and techniques, including by addressing one more of the above-referenced drawbacks and/or other shortcomings of extant systems and techniques. Consistent with some disclosed embodiments, systems, methods, and computer-readable media are provided for processing images from a video endoscopy or capsule procedure or other medical procedure, where the images are temporally ordered. Example embodiments include systems and methods that intelligently process captured images using spatio-temporal information to accurately assess the likelihood of the presence of an abnormality, a pathology, or other features of interest within the images. As a further example, a feature of interest can be a parameter or statistic related to an endoscopy or capsule procedure or other medical procedure. By way of example, a feature of interest of an endoscopy procedure may be a clean withdrawal time or time for traversal of a probe or a capsule through an organ. A feature of interest in an image may also be determined based on the presence or absence of characteristics related to that feature of interest. These and other embodiments, features, and implementations are described more fully herein. A feature of interest may be any feature in or related to one or more image, in particular in or related to a scene or field of view represented in one or more image, that is identifiable, or detectable, by analyzing the or each image. A feature of interest may for example be an object, or a location, or an action or a condition (e.g. a clinical index level).
In some embodiments, images captured by an imaging device, such as an endoscopy video camera or capsule camera, include images of a gastrointestinal tract or organ. The images may come from a medical imaging device used during, for example, a gastroscopy, a colonoscopy, or an enteroscopy. A feature of interest in the images may be an abnormality or other pathology, for example. The abnormality or pathology may comprise a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, an absence of human tissue from a location where the human tissue is expected, or a formation on or of human tissue. The formation may comprise a lesion, a polypoid lesion, or a non-polypoid lesion. Other examples of features of interest include an anatomical or other location, an action, a clinical index (e.g., cleanliness), and so on. Consequently, as will be appreciated from this disclosure, the example embodiments may be utilized in a medical context in a manner that is not specific to any single disease but may rather be generally applied.
According to one general aspect of the present disclosure, a computer-implemented system is provided for processing images captured by an imaging device. The computer-implemented system may include at least one processor configured to detect at least one feature of interest in images captured by an imaging device. The at least one processor may be configured to: receive an ordered set of images from the captured images, the ordered set of images being temporally ordered; analyze one or more subsets of the ordered set of images individually using a local spatio-temporal processing module, the local spatio-temporal processing module being configured to determine the presence of characteristics related to at least one feature of interest in each image of each subset of images and to annotate the subset images with a feature vector based on the determined characteristics in each image of each subset of images; process a set of feature vectors of the ordered set of images using a global spatio-temporal processing module, the global spatio-temporal processing module being configured to refine the determined characteristics associated with each subset of images, wherein each feature vector of the set of feature vectors includes information about each determined characteristic of the at least one feature of interest; and calculate a numerical value for each image using a timeseries analysis module, the numerical value being representative of the presence of at least one feature of interest and calculated using the refined characteristics associated each subset of images and spatio-temporal information. Further, the at least one processor may be configured to generate a report on the at least one feature of interest using the numerical value associated with each image of each subset of the ordered set of images. The report may be generated after the completion of the endoscopy or other medical procedure. The report may include information related to all features of interest identified in the processed images.
The at least one processor of the computer-implemented system may be further configured to determine a likelihood of characteristics related to at least one feature of interest in each image of the subset of images. Additionally, the at least one processor may be configured to determine the likelihood of characteristics in each image of the subset of images by encoding each image of the subset of the images and aggregating the spatio-temporal information of the determined characteristics using a recurrent neural network or a temporal convolution network.
To refine the determined characteristics, a non-causal temporal convolution network may be utilized. For example, the at least one processor of the system may be configured to refine the likelihood of the characteristics in each image of the subset of images by applying a non-causal temporal convolution network. The at least one processor may be further configured to refine the likelihood of the characteristics by applying one or more signal processing techniques including low pass filtering and/or Gaussian smoothing, for example.
According to a still further aspect, the at least one processor of the system may be configured to analyze the ordered set of images using the local spatio-temporal processing module to determine presence of characteristics by determining a vector of quality scores, wherein each quality score in the vector of quality scores corresponds to each image of the subset of the images. Additionally, the at least one processor may be configured to process ordered set of images using the global spatio-temporal processing module by refining quality scores of each image of the subset of images of the one or more subsets of the ordered set of images using signal processing techniques. The at least one processor may be further configured to analyze the one or more subsets of the ordered set of images using the local spatio-temporal processing module to determine the presence of characteristics by generating, using a deep convolutional neural network, a pixel-wise binary mask for each image of the subset of images. The at least one processor may be further configured to process the one or more subsets of the ordered set of images using the global spatio-temporal processing module by refining the binary mask for image segmentation using morphological operations exploiting prior information about the shape and distribution of the determined characteristics.
As disclosed herein, implementations may include one or more of the following features. The determined likelihood of characteristics in each image of the subset of images may include a float value between 0 and 1. The quality score may be an ordinal number between 0 and R, wherein a score 0 represents minimum quality and a score R represents the maximum quality. The numerical value may be associated with each image is interpretable to determine the probability to identify the at least one feature of interest within the image. The output may be a first numerical value for an image where the at least one feature of interest is not detected. The output may be a second numerical value for an image where the at least one feature of interest is detected. The size or volume of the subset of images may be configurable by a user of the system. The size or volume of the subset of images may be dynamically determined based on a requested feature of interest. The size or volume of the subset of images may be dynamically determined based on the determined characteristics. The one or more subsets of images may include shared images.
Another general aspect of the present disclosure related to a computer-implemented system for spatio-temporal analysis of images captured with an imaging device. The computer-implemented system may comprise at least one processor configured to receive video captured from an imaging device including a plurality of image frames. The at least one processor may be further configured to: access a temporally ordered set of images from the captured images; detect, using an event detector module, an occurrence of an event in the temporally ordered set of images, wherein a start time and an end time of the event are identified by a start image frame and an end image frame in the temporally ordered set of images; select, using a frame selector module, an image from a group of images in the temporally ordered set of images, bounded by the start image frame and the end image frame, based on an associated score and a quality score of the image, wherein the associated score of the selected image indicates a presence of at least one feature of interest; merge a subset of images from the selected images based on a matching presence of the at least one feature of interest using an objects descriptor module, wherein the subset of images is identified based on spatial and temporal coherence using spatio-temporal information; and split the temporally ordered set of images in temporal intervals which satisfy the temporal coherence of a selected task.
According to the disclosed system, the at least one processor may be further configured to determine spatio-temporal information of characteristics related to the at least one feature of interest for subsets of images of the video content using a local spatio-temporal processing module and determine the spatio-temporal information of all images of the video content using a global spatio-temporal processing module. In addition, the at least one processor may be configured to split the temporally ordered set of images in temporal intervals by identifying a subset of temporally ordered set of images with the presence of the at least one feature of interest. The at least one processor may also be configured to identify a subset of temporally ordered set of images with the presence of the at least one future of interest by adding bookmarks to images in the temporally ordered set of images, wherein the bookmarked images are part of the subset of temporally ordered set of images. Additionally, or alternatively, the at least one processor may be configured to identify a subset of temporally ordered set of images with the presence of the at least one feature of interest by extracting a set of images from the subset of the temporally ordered set of images.
Implementations may include one or more of the following features. The extracted set of images may include characteristics related to the at least one feature of interest. The color may vary with a level of relevance of an image of the subset of temporally ordered set of images for the at least one feature of interest. The color may vary with a level of relevance of an image of the subset of temporally ordered set of images for characteristics related to the at least one feature of interest.
Another general aspect includes a computer-implemented system for performing a plurality of tasks on a set of images. The computer-implemented system may comprise at least one processor configured to receive video captured from an imaging device including a set of image frames. The at least one processor may be further configured to: receive a plurality of tasks, wherein at least one task of the plurality of tasks is associated with a request to identify at least one feature of interest in the set of images; analyze, using a local spatio-temporal processing module, a subset of images of the set of images to identify the presence of characteristics associated with the at least one feature of interest; and iterate execution of a timeseries analysis module for each task of the plurality of tasks to associate a numerical score for each task with each image of the set of images.
Consistent with the present disclosure, a system of one or more computers can be configured to perform operations or actions by virtue of having software, firmware, hardware, or a combination of them installed for the system that in operation causes or cause the system to perform those operations or actions. One or more computer programs can be configured to perform operations or actions by virtue of including instructions that, when executed by data processing apparatus (such as one or more processors), cause the apparatus to perform such operations or actions.
Systems and methods consistent with the present disclosure may be implemented using any suitable combination of software, firmware, and hardware. Implementations of the present disclosure may include programs or instructions that are machine constructed and/or programmed specifically for performing functions associated with the disclosed operations or actions. Still further, non-transitory computer-readable storage media may be used that store program instructions, which are executable by at least one processor to perform the steps and/or methods described herein.
It will be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments.
The following drawings which comprise a part of this specification, illustrate several embodiments of the present disclosure and, together with the description, serve to explain the principles and features of the disclosed embodiments. In the drawings:
Example embodiments are described below with reference to the accompanying drawings. The figures are not necessarily drawn to scale. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It should also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
In the following description, various working examples are provided for illustrative purposes. However, it will be appreciated that the present disclosure may be practiced without one or more of these details.
Throughout this disclosure there are references to “disclosed embodiments,” which refer to examples of inventive ideas, concepts, and/or manifestations described herein. Many related and unrelated embodiments are described throughout this disclosure. The fact that some “disclosed embodiments” are described as exhibiting a feature or characteristic does not mean that other disclosed embodiments necessarily share that feature or characteristic.
Embodiments described herein include non-transitory computer readable medium containing instructions that when executed by at least one processor, cause the at least one processor to perform a method or set of operations. Non-transitory computer readable mediums may be any medium capable of storing data in any memory in a way that may be read by any computing device with a processor to carry out methods or any other instructions stored in the memory. The non-transitory computer readable medium may be implemented as software, firmware, hardware, or any combination thereof. Software may preferably be implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine may be implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described in this disclosure may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium may be any computer readable medium except for a transitory propagating signal.
The memory may include any mechanism for storing electronic data or instructions, including Random Access Memory (RAM), a Read-Only Memory (ROM), a hard disk, an optical disk, a magnetic medium, a flash memory, other permanent, fixed, volatile or non-volatile memory. The memory may include one or more separate storage devices collocated or disbursed, capable of storing data structures, instructions, or any other data. The memory may further include a memory portion containing instructions for the processor to execute. The memory may also be used as a working memory device for the processors or as a temporary storage.
Some embodiments may involve at least one processor. A processor may be any physical device or group of devices having electric circuitry that performs a logic operation on input or inputs. For example, the at least one processor may include one or more integrated circuits (IC), including application-specific integrated circuit (ASIC), microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field-programmable gate array (FPGA), server, virtual server, or other circuits suitable for executing instructions or performing logic operations. The instructions executed by at least one processor may, for example, be pre-loaded into a memory integrated with or embedded into the controller or may be stored in a separate memory.
In some embodiments, the at least one processor may include more than one processor. Each processor may have a similar construction, or the processors may be of differing constructions that are electrically connected or disconnected from each other. For example, the processors may be separate circuits or integrated in a single circuit. When more than one processor is used, the processors may be configured to operate independently or collaboratively. The processors may be coupled electrically, magnetically, optically, acoustically, mechanically, or by other means that permit them to interact.
Embodiments consistent with the present disclosure may involve a network. A network may constitute any type of physical or wireless computer networking arrangement used to exchange data. For example, a network may be the Internet, a private data network, a virtual private network using a public network, a Wi-Fi network, a LAN or WAN network, and/or other suitable connections that may enable information exchange among various components of the system. In some embodiments, a network may include one or more physical links used to exchange data, such as Ethernet, coaxial cables, twisted pair cables, fiber optics, or any other suitable physical medium for exchanging data. A network may also include one or more networks, such as a private network, a public switched telephone network (“PSTN”), the Internet, and/or a wireless cellular network. A network may be a secured network or unsecured network. In other embodiments, one or more components of the system may communicate directly through a dedicated communication network. Direct communications may use any suitable technologies, including, for example, BLUETOOTH™, BLUETOOTH LE™ (BLE), Wi-Fi, near field communications (NFC), or other suitable communication methods that provide a medium for exchanging data and/or information between separate entities.
In some embodiments, machine learning networks or algorithms may be trained using training examples, for example in the cases described below. Some non-limiting examples of such machine learning algorithms may include classification algorithms, video classification algorithms, data regressions algorithms, image segmentation algorithms, temporal video segmentation algorithms, visual detection algorithms (such as object detectors, face detectors, person detectors, motion detectors, edge detectors, etc.), visual recognition algorithms (such as face recognition, person recognition, object recognition, etc.), speech recognition algorithms, action recognition algorithms, mathematical embedding algorithms, natural language processing algorithms, support vector machines, random forests, nearest neighbors algorithms, deep learning algorithms, artificial neural network algorithms, convolutional neural network algorithms, recursive neural network algorithms, linear machine learning models, non-linear machine learning models, ensemble algorithms, and so forth. For example, a trained machine learning network or algorithm may comprise an inference model, such as a predictive model, a classification model, a regression model, a clustering model, a segmentation model, an artificial neural network (such as a deep neural network, a convolutional neural network, a recursive neural network, etc.), a random forest, a support vector machine, and so forth. In some examples, the training examples may include example inputs together with the desired outputs corresponding to the example inputs. Further, in some examples, training machine learning algorithms using the training examples may generate a trained machine learning algorithm, and the trained machine learning algorithm may be used to estimate outputs for inputs not included in the training examples. The training may be supervised or non-supervised, or a combination thereof. In some examples, engineers, scientists, processes, and machines that train machine learning algorithms may further use validation examples and/or test examples. For example, validation examples and/or test examples may include example inputs together with the desired outputs corresponding to the example inputs, a trained machine learning algorithm and/or an intermediately trained machine learning algorithm may be used to estimate outputs for the example inputs of the validation examples and/or test examples, the estimated outputs may be compared to the corresponding desired outputs, and the trained machine learning algorithm and/or the intermediately trained machine learning algorithm may be evaluated based on a result of the comparison. In some examples, a machine learning algorithm may have parameters and hyper parameters, where the hyper parameters are set manually by a person or automatically by a process external to the machine learning algorithm (such as a hyper parameter search algorithm), and the parameters of the machine learning algorithm are set by the machine learning algorithm according to the training examples. In some implementations, the hyper-parameters are set according to the training examples and the validation examples, and the parameters are set according to the training examples and the selected hyper-parameters. The machine learning networks or algorithms may be further retrained based on any output.
Certain embodiments disclosed herein may include computer-implemented systems for performing operations or methods comprising a series of steps. The computer-implemented systems and methods may be implemented by one or more computing devices, which may include one or more processors as described herein, configured to process real-time video. The computing device may be one or more computers or any other devices capable of processing data. Such computing devices may include a display such as an LCD display, augmented reality (AR), or virtual reality (VR) display. However, the computing device may also be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a user device having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system and/or the computing device can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet. The computing device can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship between client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship with each other.
Intelligent detector system 100 may receive as input a collection of temporally ordered images of a medical procedure, such as an endoscopy or colonoscopy procedure. Intelligent detector system 100 may output a report or information including one or more numerical value(s) (e.g., score(s)) for each image. The numerical value(s) may relate to a medical category such as a particular pathology and provide information regarding the probability of the presence of the medical category within an image frame. The images processed by intelligent detector system 100 may be images captured from a medical procedure that are stored in a database or memory device for subsequent retrieval and processing by intelligent detector system 100. In some embodiments, the output provided by intelligent detector system 100 resulting from processing the images may include a report with numerical score(s) assigned to the images and recommended next steps in accordance with medical guideline(s), for example. The report may be generated after the completion of the endoscopy or other medical procedure. The report may include information related to all features of interest identified in the processed images. Still further, in some embodiments, the output provided by intelligent detector system 100 may include recommended action(s) to be performed by the physician (e.g., performing a biopsy, removing a lesion, exploring/analyzing the surface/mucosa of an organ, etc.) in view of an identified feature(s) of interest in the images from the medical procedure. During a medical procedure, intelligent detector system 100 may directly receive the video or image frames from a medical image device, process the video or images frames, and provide during the procedure or right after the medical procedure (i.e. a short time interval, from no time to a few minutes) as feedback to the operator regarding performed action(s) by the operator, as well as a final report containing multiple measured variables, clinical indices and details about what was observed and/or in which anatomical location and/or how the operator behaved/acted during the medical procedure. Performed actions may include a recommended action or procedure in accordance with a medical guideline, such as performing a biopsy, removing a lesion, or exploring/analyzing a surface/mucosa of an organ. In some embodiments, a recommended action may be part of a set of recommended actions based on medical guidelines. A detailed description of an example computer system implementing intelligent detector system 100 for real-time processing is presented in
As disclosed herein, intelligent detector system 100 may generate a report after completion of a medical procedure that includes information based on the processing of the captured video by local spatio-temporal processing module 110, global spatio-temporal processing module 120, and time series analysis module 130. The report may include information related to the features of interest identified during the medical procedure, as well as other information such as numerical value(s) or score(s) for each image. As explained, the numerical value(s) may relate to a medical category such as a particular pathology and provide information regarding the probability of the presence of the medical category within an image frame. Further details regarding intelligent detector system 100 and the operations of local spatio-temporal processing module 110, global spatio-temporal processing module 120, and timeseries analysis module 130 are provided below with reference to the attached drawings.
In some embodiments, the report generated by system 100 may include additional recommended action(s) based on the processing of stored images from a medical procedure or real-time processing of images from the medical procedure. Additional recommended action(s) could include actions or procedures that could have been performed during a medical procedure and actions or procedures to be performed after the medical procedure. Additional recommended action(s) may be part of a set of recommended action(s) based on medical guidelines. Further, as described above, system 100 may process video in real-time to provide concurrent feedback to an operator about what is happening or identified in the video and during a medical procedure.
The output generated by intelligent detector system 100 may include a dashboard display or similar report (see, e.g.,
Intelligent detector system 100 may also generate reports in the form of an electronic file, a set of data, or data transmission. By way of example, the output generated by system 100 may follow a standardized format and/or be integrated into records such as electronic health records (EHR). The output of system 100 may also be compliant with regulations such as HIPAA for interoperability and privacy. In some embodiments, the output may be integrated into other reports. For example, the output of intelligent detector system 100 may be integrated into an electronic medical or health record for a patient. Intelligent detector system 100 may include an API to facilitate such integration and/or provide output in the form of a standardized data set or template. Standardized templates may include predefined forms or tables that can be filled with data values generated by intelligent detector system 100 by processing input video or image frames from a medical procedure. In some embodiments, reports may be generated by system 100 in a machine-readable format, such as an XML file, to support their transmission or storage, as well as integration with other systems and applications. In some embodiments, reports may be provided in other formats such as a Word, Excel, HTML, or PDF file format. In some embodiments, intelligent detector system 100 may upload data or a report to a server or database over a network (see, e.g., network 160 in
As disclosed above, a feature of interest may relate to a medical category or pathology. Intelligent detector system 100 may be implemented to handle a request to detect one or more medical categories (i.e., one or more features of interest) in the images. In the case of multiple features of interest, one instance of the components of intelligent detector system 100 may be implemented for each medical category or feature of interest. As will be appreciated from this disclosure, instances of intelligent detector system 100 may be implemented with any combination of hardware, firmware, and software, according to the speed or through-put needs, volume of images to be processed, and other requirements of the system.
In some embodiments, a single instance of intelligent detector system 100 may output multiple numerical values for each image, one for each medical category. In one example embodiment, pathologies detected by intelligent detector system 100 may include detecting polyps in the colon mucosa. Further, by way of example, intelligent detector system 100 may output a numerical value (e.g., 0) for all images among the input images where a polyp is not detected by intelligent detector system 100 and may output another numerical value (e.g., 1) for all images among the input images where the intelligent detector detects at least one polyp. In some embodiments, the numerical values can be arranged relative to a range or scale and/or indicate the probability of the presence of a polyp or other feature of interest.
The source of the input images may vary according to the imaging device, memory device, and/or needs of the application. For example, intelligent detector system 100 may be configured to process a video feed directly from a video endoscopy device and receive temporally ordered input images that are subsequently processed by the system, consistent with the embodiments disclosed herein. As a further example, intelligent detector system 100 may be configured to receive the input images from a database or memory device, the stored images being temporally ordered and previously captured using an imaging device, such as a video camera of an endoscopy device or a camera of a capsule device. Images received by intelligent detector system 100 may be processed and analyzed to identify one or more features of interest, such as one or more types of polyps or lesions.
The example system of
By way of example, intelligent detector system 100 may process a recorded video or images and provide a fully automated report and/or other output that details the feature(s) of interest observed in the processed images. Intelligent detector system 100 may use artificial intelligence or machine learning components to efficiently and accurately process the input images and make decision about the presence of features of interest based on image analysis and/or spatio-temporal information. Further, for each feature of interest that is requested or under investigation, intelligent detector system 100 can estimate its presence within the images and provide a report or other output with information indicating the likelihood of the presence of that feature and other details, such as the relative time from the beginning of the procedure or sequence of images where the feature of interest appears, estimated anatomical location, duration, most significant images, location within these images, and/or number of occurrences.
In one embodiment, intelligent detector system 100 may be configured to automatically determine the presence of gastrointestinal pathologies without the aid of a physician. As discussed above, the input images may be captured and received in different ways and using different types of imaging devices. For example, a video endoscopy device or capsule device or other medical device or other imaging device may record and provide the input images. The input images may be part of a live video feed or may be part of stored set of images received from a local or remote storage location (e.g., a local database or cloud storage). Intelligent detector system 100 may be operated as part of a procedure or service at a clinic or hospital, or it may be provided as an online or cloud service for end users to enable self-diagnostics or remote testing.
By way of example, to start an examination procedure, a user may ingest a capsule device or pill cam. The capsule device may include an imaging device and during the procedure wirelessly transmit images of the user's gastrointestinal tract to a smartphone, tablet, laptop, computer, or other device (e.g., user device 170). The captured images may then be uploaded by a network connection to a database, cloud storage or other storage device (e.g., image source 150). Intelligent detector system 100 may receive the input images from the image source and analyze the images for one or more requested feature(s) of interest (e.g., polyps or lesions). A final report may then be electronically provided as output to the user and/or their physician. The report may include a scoring or probability indicator for each observed feature of interest and/or other relevant information or medical recommendations. Additionally, intelligent detector system 100 can detect pathophysiological characteristics that are related to and an indicator of a feature of interest and score those characteristics that are determined to be present. Examples of such characteristics include bleeding, inflammation, ulceration, neoplastic tissues, etc. Further, in response to detected feature(s) of interest, the report may include information or recommendations based on medical guidelines, such as recommendations to consult with a physician and/or to take additional diagnostic examinations, for example. One or more actions may also be recommended to the physician (e.g., perform a biopsy, remove a lesion, explore/analyze the surface/mucosa of an organ, etc.) based on the analysis of the images by intelligent detector system 100 either in real-time with the medical procedure or after the medical procedure is completed.
As another example, intelligent detector system 100 could assist a physician or specialist with analyzing the video content recorded during a medical procedure or examination. The captured images may be part of the video content recorded during, for example, a gastroscopy, a colonoscopy, or an enteroscopy procedure. Based on the analysis performed by intelligent detector system 100, the full video recording could be displayed to the physician or specialist along with a colored timeline bar, where different colors correspond to different feature(s) of interest and/or scores for the identified feature(s) of interest.
As a still further example, a physician, specialist, or other individual could use intelligent detector system 100 to create a synopsis of the video recording or set of images by focusing on images with the desired feature(s) of interest and discarding irrelevant image frames. Intelligent detector system 100 may be configured to allow a physician or user to tune or select the feature(s) of interest for detection and the duration of each synopsis based on a total duration time and/or other parameters, such as preset preceding and trailing times before and after a sequence of frames with the selected feature(s) of interest. Intelligent detector system 100 can also be configured to combine all or the most relevant frames according to the requested feature(s) of interest.
As illustrated in
Referring again to the example embodiment of
Local spatio-temporal processing module 110 may be configured to process the whole input video in chunks by iterating over sequential batches or subsets of N image frames. Local spatio-temporal processing module 110 may also be configured to provide output that includes vectors or quality scores representing the determined characteristics of the feature(s) of interest in each image frame. In some embodiments, local spatio-temporal processing module 110 may output quality values and segmentation maps associated with each image frame. Further example details related to local spatio-temporal processing module 110 are provided below with reference to the
The subset of images processed by local spatio-temporal processing module 110 may include shared or overlapping images. Further, the size or arrangement of the subset of images may be defined or controlled based on one or more factors. For example, the size or volume of the subset of images may be configurable by a physician or other user of the system. As a further example, local spatio-temporal processing module 110 may be configured so that the size or volume of the subset of images is dynamically determined based on the requested feature(s) of interest. Additionally, or alternatively, the size of the subset of images may be dynamically determined based on the determined characteristics related to the requested feature(s) of interest.
Global spatio-temporal processing module 120 may be configured to provide a global perspective by processing all subset(s) of images analyzed by local spatio-temporal processing module 110. For example, global spatio-temporal processing module 120 may process the whole input video or set of input images by processing all outputs of local spatio-temporal processing module 110 at once or together. Further, global spatio-temporal processing module 120 may be configured to provide output that includes numerical scores for each image frame by processing the vectors of determined characteristics related to the feature(s) of interests. In some embodiments, global spatio-temporal processing module 120 may process the images and vectors and output refined quality scores and segmentation maps of each image. Further example details related to global spatio-temporal processing module 120 are provided below with reference to the
Timeseries analysis module 130 uses information about images determined by local spatio-temporal processing module 110 and refined by global spatio-temporal processing module 120 to output a numerical score to indicate the presence of the one or more feature(s) of interest requested by a user of intelligent detector system 100. For example, time series analysis module 130 may be configured to use spatial and temporal information of characteristics related to the feature(s) of interest determined by local spatio-temporal processing module 110 to perform timeseries analysis on the input video or images. Further example details related to timeseries analysis module 130 are provided below with reference to the
Task manager 140 may help manage the various tasks requested by users of intelligent detector system 100. A task may relate to a requested or required feature of interest and/or characteristics of a feature of interest. One or more characteristics and features of interest may be part of each task for processing by intelligent detector system 100. Task manager 140 may help manage tasks for detections of multiple features of interest in a set of input images. Task manager 140 may determine the number of instances of components of intelligent detector system 100 (e.g., local spatio-temporal processing module 110, global spatio-temporal processing module 120, and timeseries analysis module 130). Further example details of ways of handling multiple task requests to detect features of interest are provided below with reference to the
Intelligent detector system 100 may receive input video or sets of images from image source 150 via network 160 for processing. In some embodiments, intelligent detector system 100 may receive input video directly from another system, such as a medical instrument or system used to capture video when performing a medical procedure, colonoscopy, for example. After processing the images, reports of detected features of interest may be shared via network 160. As disclosed herein, the reports may be transmitted electronically and take different forms, such as electronic files, displays, and data. In some embodiments, reports are sent as files to and/or displayed at user device 170. Network 160 may take various forms depending on the system needs and environment. For example, network 160 may include or utilize any combination of the Internet, a wired Wide Area Network (WAN), a wired Local Area Network (LAN), a wireless WAN (e.g., WiMAX), a wireless LAN (e.g., IEEE 802.11, etc.), a mesh network, a mobile/cellular network, an enterprise or private data network, a storage area network, a virtual private network using a public network, and/or other types of network communications. In some embodiments, network 160 may include an on-premises (e.g., LAN) network, while in other embodiments, network 160 may include a virtualized, remote, and/or cloud network (e.g., AWS™, Azure™, IBM Cloud™, etc.). Further, network 160 may in some embodiments be a hybrid on-premises and virtualized, remote, and/or cloud network, including components of one or more types of network architectures.
User device 170 may send requests to and receive output (e.g., reports or data) from intelligent detector system 100 related to feature(s) of interest in input video or images. User device 170 may control or directly provide the input video or images to intelligent detector system 100 for processing, including by way instructions, commands, video or image set file download(s), and/or storage link(s) to storage locations (e.g., image source 150). User device 170 may comprise a smartphone, laptop, tablet, computer, and/or other computing device. User device 170 may also include an imaging device (e.g., a video or digital camera) to capture video or images for processing. In the case of capsule examination procedures, for example, user device 170 include a pill cam or similar that is ingested by the user and causes input video or images to be captured and streamed directly to intelligent detector system 100 or stored in image source 150 and subsequently downloaded and received by system 100 via network 160. The results of the image processing are then provided as output from intelligent detector system 100 to user device 170 via network 160.
Physician device 180 may also be used to send requests to and receive output (e.g., reports or data) from intelligent detector system 100 related to feature(s) of interest in input video or images. Similar to user device 170, physician device 180 may control or directly provide the input video or images to intelligent detector system 100 for processing, including by way instructions, commands, video or image set file download(s), and/or storage link(s) to storage locations (e.g., image source 150). Physician device 180 may comprise a smartphone, laptop, tablet, computer, and/or other computing device. Physician device 180 may also include an imaging device (e.g., a video or digital camera) to capture video or images for processing. In the case of video endoscopy examination, for example, physician device 180 may include a colonoscopy probe or similar with an imaging device that captures images during the examination of a patient. The captured video may be streamed as input video to intelligent detector system 100 or stored in image source 150 and subsequently downloaded and received by system 100 via network 160. In some embodiments, physician device 180 may receive a notification for further review of image frames with characteristics of interest. The results of the image processing are then provided as output (e.g., electronic reports or data in the form of files or digital display) from intelligent detector system 100 to user device 170 via network 160.
Image source 150 may include a storage location or other source for input video or images to intelligent detector system 100. Image source 150 may comprise any suitable combination of hardware, software, and firmware. For example, image source 150 may include any combination of a computing device, a server, a database, a memory device, a network communication hardware, and/or other devices. By way of example, image source 150 may include a database, memory, or storage (e.g., storage 220 of
In the example system of
Although embodiments of the present disclosure are described herein with general reference to medical image analysis and endoscopy, it will be appreciated that the embodiments may be applied to other medical image procedures, such as gastroscopy, colonoscopy, and enteroscopy. Further, embodiments of the present disclosure may be implemented for other image capture and analysis environments and systems, such as those for or including LIDAR, surveillance, autopiloting, and other imaging systems.
According to an aspect of the present disclosure, a computer-implemented system is provided for intelligently processing input video or set of images and determining the presence of features of interest and characteristics related thereto. As further disclosed herein, the system (e.g., intelligent detector system 100) may include at least one memory (e.g., a ROM, RAM, local memory, network memory, etc.) configured to store instructions and at least one processor (e.g., processor(s) 230) configured to execute the instruction (see, e.g.,
As used herein, the term “image frame” or “image” refers to any digital representation of a scene or field of view captured by an imaging device. The digital representation may be encoded in any appropriate format, such as Joint Photographic Experts Group (JPEG) format, Graphics Interchange Format (GIF), bitmap format, Scalable Vector Graphics (SVG) format, Encapsulated PostScript (EPS) format, or the like. Similarly, the term “video” refers to any digital representation of a scene or area of interest comprised of a plurality of images in sequence. The digital representation of a video may be encoded in any appropriate format, such as a Moving Picture Experts Group (MPEG) format, a flash video format, an Audio Video Interleave (AVI) format, or the like. In some embodiments, the sequence of images for an input video may be paired with audio. As will be appreciated from this disclosure, embodiments of the invention are not limited to processing input video with sequenced or temporally ordered image frames but may also process streamed or stored sets of images captured in sequence or temporally ordered. Accordingly, the terms “input video” and “set(s) of images” should be considered interchangeable and do not limit the scope of the present disclosure.
As disclosed herein, an image frame or image may include representations of a feature of interest (i.e., an abnormality or other object of interest). For example, the feature of interest may comprise an abnormality on or of human tissue. In other embodiments for non-medical procedures, the feature of interest may comprise an object, such as a vehicle, person, or other entity.
In accordance with the present disclosure, an “abnormality” may include a formation on or of human tissue, a change in human tissue from one type of cell to another type of cell, and/or an absence of human tissue from a location where the human tissue is expected. For example, a tumor or other tissue growth may comprise an abnormality because more cells are present than expected. Similarly, a bruise or other change in cell type may comprise an abnormality because blood cells are present in locations outside of expected locations (that is, outside the capillaries). Similarly, a depression in human tissue may comprise an abnormality because cells are not present in an expected location, resulting in the depression.
In some embodiments, an abnormality may comprise a lesion. Lesions may comprise lesions of the gastrointestinal mucosa. Lesions may be histologically classified (e.g., per the Narrow-Band Imaging International Colorectal Endoscopic (NICE) or the Vienna classification), morphologically classified (e.g., per the Paris classification), and/or structurally classified (e.g., as serrated or not serrated). The Paris classification includes polypoid and non-polypoid lesions. Polypoid lesions may comprise protruded, pedunculated and protruded, or sessile lesions. Non-polypoid lesions may comprise superficial elevated, flat, superficial shallow depressed, or excavated lesions. In regards to detecting abnormalities as features of interest, serrated lesions may comprise sessile serrated adenomas (SSA); traditional serrated adenomas (TSA); hyperplastic polyps (HP); fibroblastic polyps (FP); or mixed polyps (MP). According to the NICE classification system, an abnormality is divided into three types, as follows: (Type 1) sessile serrated polyp or hyperplastic polyp; (Type 2) conventional adenoma; and (Type 3) cancer with deep submucosal invasion. According to the Vienna classification, an abnormality is divided into five categories, as follows: (Category 1) negative for neoplasia/dysplasia; (Category 2) indefinite for neoplasia/dysplasia; (Category 3) non-invasive low grade neoplasia (low grade adenoma/dysplasia); (Category 4) mucosal high grade neoplasia, such as high grade adenoma/dysplasia, non-invasive carcinoma (carcinoma in-situ), or suspicion of invasive carcinoma; and (Category 5) invasive neoplasia, intramucosal carcinoma, submucosal carcinoma, or the like. These examples and other types of abnormalities are within the present disclosure. It will also be appreciated that intelligent detector system 100 may be configured to detect other types of features of interest, including for medical and non-medical procedures.
In the example of
In the example of
As further depicted in
To augment the video, computing device 193 may process the video from image device 192 and create a modified video stream to send to display device 194. The modified video may comprise the original image frames with the augmenting information to be displayed to the operator via display device 194. Display device 194 may comprise any suitable display or similar hardware for displaying the video or modified video, such as an LCD, LED, or OLED display, an augmented reality display, or a virtual reality display.
As shown in
As further shown in
Processor(s) 230 may also be communicatively connected via bus or network 250 to one or more I/O device 210. I/O device 210 may include any type of input and/or output device or periphery device, including keyboards, mouses, display devices, and so on. I/O device 210 may include one or more network interface cards, APIs, data ports, and/or other components for supporting connectivity with processor(s) 230 via network 250.
As further shown in
Processor(s) 230 and/or memory 240 may also include machine-readable media for storing software or sets of instructions. “Software” as used herein refers broadly to any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions, when executed by one or more processors 230, may cause the processor(s) to perform the various operations and functions described in further detail herein.
Implementations of computing device 200 are not limited to the example embodiment shown in
Sampler 311 may select the image frames for processing by other components of local spatio-temporal processing module 110. In some embodiments, sampler 311 may buffer an input video or set of images for a set period and extract image frames in the buffer as subsets of images for processing by module 110. In some embodiments, sampler 311 may allow the configuration of the number of frames or size of the image subsets to select for processing by local spatio-temporal processing module 110. For example, sampler 311 may be configured to receive user input that sets or tunes the number of frames or size of image subsets for processing. Additionally, or alternatively, sample 311 may be configured to automatically select the number of image frames or size of image subsets based on other factors, such as the requested feature(s) of interest for processing and/or the characteristics related to the feature(s) of interest. In some embodiments, the amount or size of sampled images may be based on the frame rate of the video (24, 30, 60, and 120 FPS). For example, sampler 311 may periodically buffer a real-time video stream received by intelligent detector system 100 for a set period to extract images from the buffered video. As a further example, a stream of images from a pill cam or other imaging device may be buffered and the images for processing by system 100 may be extracted from the buffer. In some embodiments, sampler 311 may selectively sample images based on other factors such as video quality, length of the video, characteristics related to the requested feature(s) of interest, and/or the feature(s) of interest. In some embodiments, sampler 311 may sample image frames based on components of local spatio-temporal processing module 110 involved in performing a task requested by a user of intelligent detector system 100. For example, encoder 312 using a 3D encoder network may require multiple images to create a three-dimensional structure of the content to encode.
Encoder 312 may determine the presence of characteristics related to each feature of interest that is part of a task requested by a user of intelligent detector system 100. For image analysis, encoder 312 may be implemented using a trained convolution neural network. Intelligent detector system 100 may include a 2D encoder and a 3D encoder containing non-local layers as encoder 312. Encoder 312 may be composed of multiple convolutional residual and fully connected layers. Depending on the characteristics and features of interest to be detected, encoder 312 may select a 2D or 3D convolutional encoder. Encoder 312 may be trained to detect characteristics in an image that are required to detect requested feature(s) of interest in the image frames. As disclosed herein, Intelligent detector system 100 may process images and detect desirable characteristics related to the feature(s) of interest using encoder 312. Intelligent detector system 100 may determine the desirable characteristics based on the trained network of encoder 312 and past determinations of feature(s) of interest.
As shown in
RNNs 313 are artificial neural networks with internal feedback connections and an internal memory status used to determine spatio-temporal information in the image frames. In some embodiments, RNNs 313 may include local layers to improve its capability of aggregating spatio-temporal information spatially and/or temporally apart in buffered image frames selected by sampler 311. RNNs 313 can be configured to associate a score, e.g., between 0 and 1, for each desirable characteristic related to a requested feature of interest. The score indicates the likelihood of the presence of the desirable characteristic in an image, with 0 being least likely and 1 being a maximum likelihood.
As shown in
Segmentation network 316 may process the images to compute for each input image a segmentation mask to extract a segment of the image, including characteristics related to a feature of interest. A segmentation mask may be a pixel-wise binary mask with a resolution that is the same as or less than that of the input image. Segmentation network 316 may be implemented as a deep convolutional neural network including multiple convolutional residual layers and multiple skip-connections. The number of layers and type of layers included in a segmentation network may be based on the characteristics or requested feature of interest. In some embodiments, a single model with multiple layers may handle tasks of encoder 312 and segmentation network 316. For example, a single model can be a U-Net model with a ResNet encoder.
By way of example, segmentation network 316 may take an image with dimensions W×H as input and return a segmentation mask represented by a matrix with dimensions W′×H′, where W′ is lesser than or equal to W and H′ is lesser than or equal to H. Each value in the output matrix represents the probability that certain image frame coordinates contain characteristics associated with the requested feature of interest. In some embodiments, intelligent detector system 100 may produce multiple output matrices for multiple features of interest. The multiple matrices may vary in dimensions.
Global spatio-temporal processing module 120 may refine the scores of determined characteristics using one or more non-causal temporal convolutional networks (TCN) 321. Global spatio-temporal processing module 120 can process the output of all images processed by local spatio-temporal processing module 110 using dilated convolutional networks included as TCNs 321. Such dilated convolution networks help to increase receptive field without increasing the network depth (number of layers) or the kernel size and can be used for multiple images together.
As further disclosed herein, TCNs 321 may review a whole timeseries of features K×T′ extracted using local spatio-temporal processing module 110. TCNs 321 may take as input matrix of features of K×T′ dimensions and return one or more multiple timeseries of scalar values of length T″.
Global spatio-temporal processing module 120 may refine quality scores generated by quality network 315 using one or more signal processing techniques such as low-pass filters and Gaussian smoothing. In some embodiments, global spatio-temporal processing module 120 may refine segmentation masks generated by segmentation network 316 using a cascade of morphological operations. Global spatio-temporal processing module 120 may refine binary mask for segmentation by using prior information about shape and distribution of the determined characteristics across input images identified by encoder 312 in combination with RNNs 313 and causal TCNs 314.
TCNs 321 may work on the complete video or sets of input images and thus need to wait on local spatio-temporal processing module 110 to complete processing individual image frames. To accommodate this requirement, TCNs 321 may be trained following the training of the networks in local spatio-temporal processing module 110. The number and architecture of layers of TCNs 321 is dependent on the task(s) requested by a user of intelligent detector system 100 to detect certain feature(s) of interest. TCNs 321 may be trained based on requested feature(s) of interest to be tasked to the system. A training algorithm for TCNs 321 may tune parameters of TCNs 321 for each such task or feature of interest.
By way of example, intelligent detector system 100 may train TCNs 321 by first computing K-dimensional timeseries for each video in training set 415 using local spatio-temporal processing module 110 and applying a gradient-descent based optimization to estimate TCNs 321 parameters to minimize loss function L (s, s′), where s is the estimated output timeseries of scores, and s′ is the ground truth timeseries. Intelligent detector system 100 may calculate the distance between s and s′ using, e.g., mean squared error (MSE), cross entropy, and/or Huber loss.
Similar to training processes for other neural networks, data augmentation, learning rate tuning, label smoothing, and/or batch training may be used when training TCNs 321 to improve its capabilities and the results.
Intelligent detector system 100 may be adapted to a specific requirement by tuning hyperparameters of components 110-130. In some embodiments, intelligent detector system 100 may modify the pipeline's standard architecture to process input video or sets of images by adding or removing components 110-130 or certain parts of components 110-130. For example, if intelligent detector system 100 needs to address a very local task, it may drop the usage of TCNs 321 in global spatio-temporal processing module 120 to avoid any global spatio-temporal processing of output generated by local spatio-temporal processing module 110. As another example, if a user of intelligent detector system 100 requests a diffuse task, RNNs 313 and/or TCNs 314 may be removed from local spatio-temporal processing module 110. Intelligent detector system 100 may remove global spatio-temporal processing module 120 or some RNNs 313 and TCNs 314 of local spatio-temporal processing module 110 by deactivating the relevant networks from the pipeline used for processing input video or images to detect requested pathologies.
Other arrangements or implementations of the system are also possible. For example, segmentation network 316 may be unnecessary and dropped from the pipeline when the requested task for detecting features of interest does not deal with focal objects in image frames of the images. As another example, quality network 315 may be unnecessary and deactivated when all image frames are regarded useful. For example, when the frame rate of input video is low or has too many image frames with errors, then intelligent detector system 100 may avoid further filtering image frames by using quality network 315. As will be appreciated from this disclosure, intelligent detector system 100 may pre-process and/or sample input video or images to determine the components that need to be active and trained as part of local spatio-temporal processing module 110 and global spatio-temporal processing module 120.
Event detector 331 may determine the start and stop times in an input video of an event associated with a requested feature of interest. In some embodiments, event detector 331 determines the start and end image frame in an input video of events associated with a requested feature of interest. In some embodiments, the start and stop times or image frames of events may overlap.
The start and stop times of the events may be the beginning and end of portions of the input video where some of the characteristics related to a feature of interest are detected. The start and stop times of video may be estimations due to missing image frames from the analysis by local spatio-temporal processing module 110. Event detector 331 may output a list of pairs (t,d), where t is a time instance and d is the description of the event detected at that time. Various events may be identified based on different features of interest processed by intelligent detector system 100.
Portions of the input video identified from events may include portions of an organ scanned by a healthcare professional or other operator to generate input video as part of a medical procedure. For example, a medical procedure such as a colonoscopy may include events configured for various portions of a colon, such as ascending colon, transverse colon, or descending colon.
Timeseries analysis module 130 may provide a summary report of the events of different portions of the video that represent different portions of a medical procedure. The summary report may include, for example, the length of video or time taken to complete a scan of a portion of the organ associated with the event, which may be listed as a withdrawal time. Event detector 331 may help generate the summary report of different portions of medical procedure that include events related to the features of interest.
Timeseries analysis module 130 may present summary report(s) of different portions of a medical procedure video (e.g., colonoscopy video) on a dashboard or other display showing, e.g., pie charts with the amount of video requiring different actions on the portion of the video or portion of the organ represented by the video portion, such as careful second review, performing a biopsy, or removing a lesion. In some embodiments, the dashboard may include quality summary details of events identified by event detector 331 in a color-coded manner. For example, the dashboard may include red, orange, and green colored buttons or other icons to identify the quality of video of a portion of a medical procedure representing an event. The dashboard may also include summary details of the overall video representing the whole medical procedure with same level of information as that provided for individual portions of the medical procedure.
In some embodiments, the summary report generated by timeseries analysis module 130 may identify one or more frames to review portion(s) more carefully and/or address other issues. The summary report may also indicate what percentage of the video to conduct additional operations, such as second review. Timeseries analysis module 130 may use frame selector 332 to identify specific frames of the video or the percentage of video to conduct additional operations.
Frame selector 332 may retrieve image frames in the input video based on the characteristics and scores generated by local spatio-temporal processing module 110. In some embodiments, frame selector 332 may also utilize the user provided quality values to select image frames. Frame selector 332 may select image frames based on their relevance to characteristics and/or features of interest requested by a user of intelligent detector system 100.
In some embodiments, the summary report generated by timeseries analysis module 130 may include one or more image frames identified by frame selector 332. An image frame presented in the report may be augmented to display marking(s) applied to one or more portions of the frame. In some embodiments, markings may identify a feature of interest such as a lesion or polyp in an image frame. For example, a colored bounding box may be used as a marking surrounding the feature of interest (see, e.g.,
Objects descriptor 333 may merge image frames of input video that include matching characteristics from the requested features of interest. Objects descriptor merges image frames based on temporal and spatial coherence information provided local spatio-temporal processing module 110. Objects descriptor 333 output may include a set of objects described using sets of properties. Property sets may include a timestamp of image frames relative to other image frames of the input video. In some embodiments, property sets may include statistics on estimated scores and locations of detected characteristics or requested features of interest in image frames.
Temporal segmentor 334 splits an input video into temporal intervals. Temporal segmentor 334 may split based on coherence on task to determine requested features of interest. Temporal segmentor 334 may output a label for each image frame of the input video in the form {L_i}. The output labels may indicate the presence and probability of a requested feature of interest in an image frame and position within the image frame. In some embodiments, temporal segmentor 334 may output separate labels for each feature of interest in each image frame.
In some embodiments, timeseries analysis module 130 may generate a dashboard or other display including quality scores for a medical procedure performed by a physician, healthcare professional, or other operator. To provide the quality scores, time analysis module 130 may include machine learning models that are trained based on videos of the medical procedure performed by other physicians and operators with different examination performance behaviors. Among other things, machine learning models may be trained to recognize video segments during which the examination behavior of the healthcare professional indicates the need for additional review. For example, an endoscopist carefully exploring the colon/small bowel surface, as opposed to the time he spends cleaning it or performing surgeries or navigating etc. may indicate requirement of additional review of the small bowel surface. Machine learning models used by timeseries analysis module 130 may learn about particular activity of a healthcare professional such as careful exploration based on the amount of time spent, number of pictures taken, and/or number of repeated scans of a certain section of a medical procedure representing certain portion of an organ. In some embodiments, machine learning model may learn about healthcare professional behavior based on the amount of markings in the form of notes or flags added to the video or certain areas of the image frame in a video.
In some embodiments, timeseries analysis module 130 may generate a summary report of quality scores of a healthcare professional behavior using information about the time spent performing certain actions (e.g., careful exploration, navigating, cleaning, etc.). In some embodiments, the percentage of total time of medical procedure for a certain action may be used for calculating the quality score of the medical procedure or a portion of the medical procedure. Timeseries analysis module 130 may be configured to generate a quality summary report of healthcare professional behavior based on the configuration of intelligent detector system 100 to include actions performed by the healthcare professional as features of interest.
To generate a dashboard with the summary scores described above, timeseries module may utilize event detector 331, frame selector 332, object descriptor 333, and temporal segmentor 334 in combination. The dashboard may include one or more frame(s) from the medical procedure that are selected by frame selector 332 and information regarding the total time spent on the medical procedure and the time spent examining portions with pathologies or other features of interest. The quality score summary of statistics describing healthcare professional behavior may be computed for the whole medical procedure (e.g., whole colon scan) and/or for portion(s) identifying an anatomical region (e.g., colon segments such as ascending colon, transverse colon, and descending colon).
Timeseries analysis module 130 may use event detector 331, frame selector 332, object descriptor 333, and temporal segmentor 334 to generate aggregate information about different features of interest, such as different regions of an organ captured during a medical procedure, the presence of each pathology, and/or actions of a healthcare professional performing the medical procedure. For example, aggregate information may be generated based on a listing of the various pathologies in different regions using object descriptor 333, frame(s) showing a pathology selected by frame selector 332, and an identified amount of time spent in the region of each pathology and the healthcare professional actions determined by event detector 331.
In some embodiments, timeseries analysis module 130 may generate a summary of input video processed by local spatio-temporal processing module 110 and global spatio-temporal processing module 120. Summary of input video may include segments of input video extracted and combined into a summary of the input video with features of interest. In some embodiments, the user can choose whether to view only the summary video or to expand each of the interval of the video which were discarded by the module. Temporal segmentor 334 of timeseries analysis module 130 may help extract portions of input video with features of interest. In some embodiments, timeseries analysis module 130 may generate a video summary by selecting relevant frames to generate a variable frame rate video output. Frame selector 332 of timeseries analysis module 130 may aid in the selection and dropping of frames in an output video summary. In some embodiments, timeseries analysis module 130 may provide additional metadata to the input video or a video summary. For example, timeseries analysis module 130 may color code the timeline of an input video where features of interest are present in an input video. Timeseries analysis module 130 may use different colors to highlight a timeline with different features of interest. In some embodiments, the portions of output video summary with features of interest may include overlayed text and graphics on the output video summary.
In some embodiments, to maximize performance modules 110-130 may be trained to select optimal parameter values for the neural networks in each of the modules 110-130.
The components of local spatio-temporal processing module 110 shown in
During the training process, intelligent detector system 100 may sample from training dataset images or a buffer of images processed by the neural networks in components 110-130 of intelligent detector system 100 and updates their parameters by error backpropagation. Intelligent detector system 110 may control the convergence of ground truth value y′ of a desirable characteristic and encoder 312 output value y using validation set 416 of a video set. Intelligent detector system 100 may use test set 417 of a video set to assess the performance of encoder 312 to determine values of characteristics in image frames of a training subset of a video set. Intelligent detector system 100 may continue to train until the ground truth value y′ converges with the output value y. Intelligent detector system 100 upon reaching convergence may complete the training procedure and remove the temporary fully connected network. Intelligent detector system 100 finalizes encoder 312 for the latest value of parameters.
In some embodiments, the temporary network may be decoder network 412 used by intelligent detector system 100 to train encoder 312. Decoder network 412 may be a convolutional neural network that maps each feature vector estimated by encoder 312 to a large matrix (I_out) of the same dimensions as an image frame (I_in) of an input video. Decoder network 412 may use L(I_out, I_in) as loss function 413 to compute the distance between two images (or buffers of N images). Loss function 413 used with decoder network 412 may include mean squared error (MSE), structural similarity (SSIM), or L1 norm. Decoder network 412 used as a temporary network to train encoder 312 does not require the determination of ground truth values for the training/validation/testing subsets 415-417 of a video set. Intelligent detector system 100 training encoder 312 using decoder network 412 as the temporary network may control convergence with validation set 416 and use test set 417 to assess the expected performance of encoder 312. Intelligent detector system 110 may drop or deactivate decoder network 412 after completing encoder 312 training.
In both training methods using fully connected network 411 and decoder network 412, encoder 312 and other components of intelligent detector system 100 may use techniques such as data augmentation, learning rate tuning, label smoothing, mosaic, MixUp, and CutMix data augmentation, and/or batch training to improve the training process of encoder 312. In some embodiments, neural networks in intelligent detector system 100 may suffer from class imbalance and may use ad-hoc weighted loss functions and importance sampling to avoid a prediction bias for the majority class.
Intelligent detector system 100 may train RNNs 313 and TCNs 314 using the output of a previously trained encoder 312. The input to RNNs 313 and TCNs 314 may be an M-dimensional feature vector per time instant output by encoder 312. RNNs 313 and TCNs 314 aggregate multiple feature vectors generated by encoder 312 by buffering feature vectors generated by encoder 312. Intelligent detector system 100 may train RNNs 313 and TCNs 314 by feeding a sequence of consecutive image frames encoder 312 and passing the generated feature vectors to RNNs 313 and TCNs 314. For a sequence of B images (or buffered sets of images), encoder 312 produces B vectors of M encoded features and sent to RNNs 313 or TCNs 314 to produce B vectors of K features.
Intelligent detector system 100 may train RNNs 313 and TCNs 314 by including a temporary fully connected network (FCN) 411 at the end of RNNs 313 and TCNs 314. FCN 411 converts K dimensional feature vector generated by RNNs 313 or TCNs 314 to a one-dimensional score and compares against ground truth in loss function to revise parameters until there is a convergence between output vector and ground truth vector. In some embodiments, intelligent detector system 100 improves RNNs 313 and TCNs 314 by using data augmentation, learning rate tuning, label smoothing, batch training, weighted sampling, and/or importance sampling as part of training RNNs 313 and TCNs 314.
Intelligent detector system 100 may train segmentation network 316 using one or more individual images or small buffer with size N. The buffer size N may be based on the number of images considered by encoder 312 trained in
In some embodiments, intelligent detector system 100 may use data augmentation, such as ad-hoc morphological operations and affine transformations with each image frame in input video and mask generated for each image frame, learning rate tuning, label smoothing, and/or batch training to improve results of segmentation network 316.
Temporal convolution networks (TCNs) 321 of global spatio-temporal processing module 120 may access the whole timeseries of features T′×K extracted by local spatio-temporal processing module 110 working on T′ image frames to generate feature vectors of 1×K dimension. Global spatio-temporal processing module 120 takes the whole matrix T′×K of features as input and returns a timeseries of scalar values of length T″. Intelligent detector system 100 may train global spatio-temporal processing module 120 by training TCNs 321.
Intelligent detector system 100 training of TCNs 321 and, in turn, global spatio-temporal processing module 120 may consider the number of processing layers of TCNs 321 and their architectural structure. The number of layers and connections vary based on task for determining features of interest and need to be tuned for each task.
Intelligent detector system 100 trains global spatio-temporal processing module 120 by computing K-dimensional timeseries of scores for image frames of each video in training set 415. Intelligent detector system 100 computes timeseries scores by providing training set 415 videos as input to previously trained local spatio-temporal processing module 110 and its output to global spatio-temporal processing module 120. Intelligent detector system 100 may use gradient-descent based optimization to estimate the network parameters of TCNs 321 neural network. Gradient-descent based optimization can minimize the distance between timeseries scores s output by global spatio-temporal processing module 120 and ground truth time series scores s′. Loss function 413 used to train global spatio-temporal processing module 120 can be a mean squared error (MSE), cross entropy, or Huber loss.
In some embodiments, intelligent detector system 100 may use data augmentation, learning rate tuning, label smoothing, and/or batch training techniques to improve results of trained global spatio-temporal processing module 120.
As illustrated in
Local spatio-temporal processing module 110 may output a K×T′ matrix 531 of characteristic scores. T′ is the number of frames of input video 501 iteratively analyzed by local spatio-temporal processing module 110. Local spatio-temporal module 110 generates a vector of size K of characteristic scores for each analyzed frame of T′ frames. The size K may match the number of features of interest requested by a user of intelligent detector system 100. Local spatio-temporal processing module 110 may process input video 501 using sampler 311 to retrieve some or all of the T image frames. T′ frames, analyzed by the components of local spatio-temporal processing module to generate characteristic scores, can be less or equal to the total T image frames of input video 501. Sampler 311 may select T′ frames for analysis by other components 312 and 315-317. In some embodiments, RNNs 313 and TCNs 314 may generate scores for only T′ image frames of sampled frames. Networks 313-314 may include T′ image frames based on the presence of at least one characteristic of the requested features of interest. Local spatio-temporal processing module uses only one set of networks 313 or 314 to process image frames and generate matrix 531 of characteristic scores.
Local spatio-temporal processing module generates the matrix 531 of characteristic scores for T′ image frames by reviewing each image frame individually or in combination with a subset of image frames of input video 501 buffered and provided by sampler 311.
Local spatio-temporal processing module 110 may generate additional matrices 532-534 of scores using networks 315-317. Quality network 315 may generate a quality score of each image frame considered by sampler 311 for determining characteristics related to the features of interest in each image frame. As illustrated in
Segmentation network 316 may generate matrix 533 of segmentation masks by processing T′″ image frames of input video 501. Matrix 533 is of dimensions W′×H′×T′″ include T′″ masks of height H′ and width H′. In some embodiments, width W′ and height H′ of the segmentation mask may be lesser than the dimensions of a processed image frame. Segmentation network 316 may analyze image frames extracted by sampler 311 to generate segmentation masks for T′″ image frames. In some embodiments, T′″ may be less than the total number of frames T of input video 501. Segmentation network 316 may only process T′″ frames with a segmentation mask if they include at least some of the characteristics or requested features of interest.
As illustrated in
Global spatio-temporal processing module 120 may use post-processors 523-525 to refine matrices 532-534 of additional scores and details used in determining requested features of interest to generate matrices 542-544.
By way of example, post-processor 523 refines quality scores matrix 532 using one or more standard signal processing techniques such as low-pass filters and Gaussian smoothing. Post-processor 523 outputs matrix 542 of dimension 1×U″ of refined scores. In some embodiments, value U″ may be different from value T″. For example, U″ may be less than T″ if certain image frames of low quality score were ignored by post-processor 523. Alternatively, U″ may be more than T″ when video 501 is upsampled to generate more image frames and image frames with higher resolution.
Post-processor 524 may refine segmentation masks matrix 533 using a cascade of morphological operations exploiting prior information about the shape and distribution of each feature of interest. Post-processor 524 may output of matrix 543 of dimension W′×H′×U′″. In some embodiments, the dimension of U′″ may be different than T′″. For example, U′″ may be less than T′″ if certain image frames of low quality score were ignored by post-processor 524. Alternatively, U′″ may be more than T′″ when video 501 is upsampled to generate more image frames and image frames with higher resolution.
As illustrated in
As illustrated in
Task manager 140 may maintain separate pipelines for each task and train them independently. As illustrated in
Sampler 311 and quality network 315 may rely on input image data and work on image data in the same manner irrespective of the requested features of interest. Accordingly, in pipeline 650, local spatio-temporal processing module 630's components sampler 631 and quality network 635 dependent on input data and unrelated to the requested task are shared between tasks 602 and 603 processing input video 601. Pipeline 650 can share their output between multiple tasks be processed by downstream components in pipeline 650.
Encoder 312 may depend on the requested task to identify the right annotations for image frames of input video 601, but it can depend more on the input data and can also be shared between different tasks. Accordingly, pipeline 650 may share encoder 632 among tasks 602 and 603. Further, sharing encoder 632 across tasks may improve its training due to the larger number of samples available across multiple tasks.
Quality network 315 directly works on the quality of the image without relying on the requested tasks. Thus, using separate instances of quality network 315, one per task becomes redundant as the quality score of an image frame in input video (e.g., input video 601) has no relation to the requested task (e.g., tasks 602 and 603) and will result in the same operation applied multiple times on input video 601.
Segmentation network 316 is more dependent on a requested task than the above-discussed components. However, it can still be shared as it is easier to generate multiple outputs for different tasks (e.g., tasks 602 and 603). As illustrated in
Neural networks 633-634 may include either instance of RNNs 313 or TCNs 314 that generate matrices of characteristics scores specific to requested features of interest to identify in different tasks. Local spatio-temporal processing module 630 of pipeline 650 may be configured to generate multiple copies of encoder output 637 and 638 and provided as input to multiple neural networks 633 and 634 one per task.
As illustrated in
In step 710, intelligent detector system 100 may receive an input video or ordered set of images over network 160. As disclosed herein, the images to be processed may be temporally ordered. Intelligent detector system 100 may request images directly from image source 150. In some embodiments, other external devices such as physician device 180 and user device 170 may direct intelligent detector system 100 to request image source 150 for images. In some embodiments, user device 170 may submit a request to detect features of interest in images currently streamed or otherwise receive by image source 150.
In step 720, intelligent detector system 100 may analyze subsets of images individually to determine characteristics related to each requested feature of interest. Intelligent detector system 100 may use sampler 311 (as shown in
Intelligent detector system 100 may allow configuration of the number of images to include in a subset of images, as disclosed herein. Intelligent detector system 110 may automatically configure the size of the subset based on the requested features of interest or characteristics related thereto. In some embodiments, a user of intelligent detector system 100 may configure the subset size based on input from a user or physician (e.g., through user device 170 or physician device 180 of
Intelligent detector system 100 may analyze the subset of images using local spatio-temporal processing module 110 to determine the likelihood of characteristics in each image of the subset of images. The likelihood of characteristics related to each feature of interest may be represented by a range of continuous or discrete values. For example, the likelihood of characteristics may be represented using a value ranging between 0 and 1.
Intelligent detector system 100 may detect characteristics by encoding each image of a subset of images using encoder 312. As part of the analysis process, intelligent detector system 100 may aggregate spatio-temporal information of the determined characteristics using recurrent neural network (E.g., RNN(s) 313 as shown in
Intelligent detector system 100 may determine additional information about each image using quality network 315 (as shown in
In some embodiments, Intelligent detector system 100 may generate additional information regarding characteristics using segmentation network 316. Additional information may include information on portions of images in each image. Intelligent detector system 100 may use segmentation network 316 to extract portions of the image with requested features of interest by generating segmentation masks for each image of a subset of images. Segmentation network 316 may use a deep convolution neural network to extract images.
In step 730, intelligent detector system 100 may process vectors of information about images and the determined characteristics of images in step 720. Intelligent detector system 100 may use global spatio-temporal processing module 120 to process output generated by local spatio-temporal processing module 110 in step 720. Intelligent detector system 100 may process vectors of information associated with all images together to refine vectors of information, including characteristics determined in each image. Global spatio-temporal processing module 120 may apply a non-causal temporal convolution network (e.g., Temporal Convolution Network(s) 321 of
Intelligent detector system 100 may also refine vectors with additional information about images and characteristics such as quality scores and segmentation masks using post-processors (e.g., post-processor 322 as shown in
For example, as shown in
In some embodiments, intelligent detector system 100 may refine segmentation masks used for image segmentation for extracting portions of each image containing requested features of interest using post-processors (e.g., post-processor 322 as shown in
In step 740, intelligent detector system 100 may associate numerical value to each image based on refined characteristics for each image of an ordered set of images in step 730. Components of intelligent detector system 100 may interpret the assigned numerical value of each image to determine the probability to identify a feature of interest within each image. Intelligent detector system 100 may present different numerical values to indicate different states of each requested feature of interest. For example, intelligent detector system 100 may output a first numerical value for each image where a requested feature of interests is detected and output a second numerical value for each image where the requested feature of interest is not detected.
In some embodiments, intelligent detector system 100 may interpret associated numerical value to determine a position in an image where a characteristic of a requested feature of interest is present or the number of images that include a characteristic. Following step 740, intelligent detector system 100 may generate a report (step 750) with information on each feature of interest based on the numerical values associated with each image. As disclosed above, the report may be presented electronically in different forms (e.g., a file, a display, a data transmission, and so on) and may include information about the presence of each requested feature of interest as well as additional information and/or recommendations based on, for example, medical guidelines. Intelligent detector system 100, upon completion of step 750, completes the process (step 799) and execution of method 700 on, for example, computing device 200.
In step 810, intelligent detector system 100 may access a temporally ordered set of images of video content over network 160 (as shown in
In step 820, intelligent detector system 100 may detect an occurrence of an event in the temporally ordered set of images using spatio-temporal information of characteristics in each image of the ordered set of images. Intelligent detector system 100 may detect events using event detector 331 (as shown in
Intelligent detector system 100 upon detection of an event may add color to a portion of a timeline of a video content that matches the subset of the temporally ordered set of images of the video content where an event was discovered.
The color may vary with the level of relevance of an image of a subset of a temporally ordered set of images for a characteristic related to a feature of interest. The color may vary with the level of relevance of an image of the subset of a temporally ordered set of images for one or more characteristics.
Intelligent detector system 100 may use the determined spatio-temporal information of characteristics to determine in a temporally ordered set of images where an event representing an occurrence of a feature of interest is present.
In step 830, intelligent detector system 100 may select an image from groups of images using frame selector 332 (as shown based on
In step 840, intelligent detector system 100 may merge subsets of images with matching characteristics based on spatial and temporal coherence using object descriptor 333. Intelligent detector system may determine spatial and temporal coherence of characteristics using spatio-temporal information of characteristics in each image determined in step 820.
In step 850, intelligent detector system 100 may split temporally ordered set of images satisfying temporal coherence of selected tasks using temporal segmentor 334 (as shown in
Intelligent detector system 100 may extract a clip of the video content matching one of the split subsets of the temporally ordered set of images of the video. The extracted clips may include at least one feature of interest. Intelligent detector system 100, upon completion of step 850, completes (step 899) executing 800 on computing device 200.
In step 910, intelligent detector system 100 may receive a plurality of tasks (e.g., tasks 602 and 603 of
In step 920, intelligent detector system 100 may analyze a subset of images using local spatio-temporal processing module 110 (as shown in
In some embodiments, intelligent detector system 100 may use global spatio-temporal analysis module 120 to refine characteristics identified by local spatio-temporal processing module 110 by filtering incorrectly identified characteristics. In some embodiments, global spatio-temporal processing module 120 may highlight and flag some characteristics identified by local spatio-temporal processing module 110. In some embodiments, global spatio-temporal processing module 120 may filter using additional components such as quality network 315 and segmentation network 316 applied once against the set of images to generate additional information about the input images.
In step 930, intelligent detector system 100 may iteratively execute time series analysis module 130 for each task of the requested set of tasks to associate numerical score to each image of the input set of images. In some embodiments, intelligent detector system 100 may include multiple instances of timeseries module 130 to process multiple tasks simultaneously. For example, timeseries modules 671 and 672 (as shown in
The diagrams and components in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer hardware or software products according to various example embodiments of the present disclosure. For example, each block in a flowchart or diagram may represent a module, segment, or portion of code, which includes one or more executable instructions for implementing the specified logical functions. It should also be understood that in some alternative implementations, functions indicated in a block may occur out of order noted in the figures. By way of example, two blocks or steps shown in succession may be executed or implemented substantially concurrently, or two blocks or steps may sometimes be executed in reverse order, depending upon the functionality involved. Furthermore, some blocks or steps may be omitted. It should also be understood that each block or step of the diagrams, and combination of the blocks or steps, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or by combinations of special purpose hardware and computer instructions. Computer program products (e.g., software or program instructions) may also be implemented based on the described embodiments and illustrated examples.
It should be appreciated that the above-described systems and methods may be varied in many ways and that different features may be combined in different ways. In particular, not all the features shown above in a particular embodiment or implementation are necessary in every embodiment or implementation. Further combinations of the above features and implementations are also considered to be within the scope of the herein disclosed embodiments or implementations.
While certain embodiments and features of implementations have been described and illustrated herein, modifications, substitutions, changes and equivalents will be apparent to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes that fall within the scope of the disclosed embodiments and features of the illustrated implementations. It should also be understood that the herein described embodiments have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the systems and/or methods described herein may be implemented in any combination, except mutually exclusive combinations. By way of example, the implementations described herein can include various combinations and/or sub-combinations of the functions, components and/or features of the different embodiments described.
Moreover, while illustrative embodiments have been described herein, the scope of the present disclosure includes embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations or alterations based on the embodiments disclosed herein. Further, elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described herein or during the prosecution of the present application. Instead, these examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples herein be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
Number | Date | Country | Kind |
---|---|---|---|
22183987.1 | Jul 2022 | EP | regional |
Number | Date | Country | |
---|---|---|---|
63368025 | Jul 2022 | US |