It is estimated that over 250 million people in developing regions around the world suffer from permanent vision impairment, with 80% of disease cases either preventable or curable if detected early. Thus, preventable eye diseases such as diabetic retinopathy, macular degeneration, and glaucoma continue to proliferate, causing permanent vision loss over time as they are left undiagnosed.
Current diagnostic methods are time-consuming and expensive, requiring trained clinicians to manually examine and evaluate digital color photographs of the retina, thereby leaving many patients undiagnosed and susceptible to vision loss over time. Therefore, there is an ongoing need for solutions enabling the early detection of retinal diseases to provide patients with timely access to life-altering diagnostics without dependence on medical specialists in clinical settings.
Objects, features, and advantages of the disclosure will be readily apparent from the following description of certain embodiments taken in conjunction with the accompanying drawings in which:
Disclosed herein are methods and apparatuses that circumvent the need for a clinician in diagnosing vision-degenerative diseases. The methods use an automated classifier to distinguish healthy and pathological fundus (retinal) images. The disclosed embodiments provide an all-purpose solution for vision-degenerative disease detection, and the excellent results attained indicate the high efficacy of the disclosed methods in providing efficient, low-cost eye diagnostics without dependence on clinicians.
Automated solutions to current diagnostic shortfalls have been investigated in the past, but they suffer from major drawbacks that hinder transferability to clinical settings. For example, most automated algorithms derive predictive potential from small datasets of around 300 fundus images taken in isolated, singular clinical environments. These algorithms are unable to generalize to real-world applications where image acquisition will encounter artifacts, brightness variations, and other perturbations.
Based upon the need for a highly discriminative algorithm for distinguishing healthy fundus images from pathological ones, the disclosed method uses state-of-the-art deep learning algorithms. Deep learning algorithms are known to work well in computer vision applications, especially when training on large, varied datasets. By applying deep learning to a large-scale fundus image dataset representing a heterogeneous cohort of patients, the disclosed methods are capable of learning discriminative features.
In some embodiments, a method is capable of detecting symptomatic pathologies in the retina from a fundus scan. The method may be implemented on any type of device with some sort of computational power, such as a laptop or smartphone. The method utilizes state-of-the-art deep learning methods for large-scale automated feature learning to represent each input image. These features are normalized and compressed using computational techniques, for example, kernel principal component analysis (PCA), and they are fed into multiple second-level gradient-boosting decision trees to generate a final diagnosis. In some embodiments, the method reaches 95% sensitivity and 98% specificity with an area under the receiver operating characteristic AUROC of 0.97, thus demonstrating high clinical applicability for automated early detection of vision-degenerative diseases
The disclosed methods and apparatuses have a number of advantages. First, they place diagnostics in the hands of the people, eliminating dependence on clinicians for diagnostics. Individuals or technicians may use the methods disclosed herein, and devices on which those methods run, to achieve objective, independent diagnoses. Second, they reduce unnecessary workload on clinicians in medical settings; rather than spending time trying to diagnose potentially diseased patients out of a demographic of millions, clinicians can attend to patients already determined to be at high-risk for a vision loss disease, thereby focusing on providing actual treatment in a time-efficient manner.
In some embodiments, a large set of fundus images representing a variety of eye conditions (e.g., healthy, diseased) is processed using deep learning techniques to determine a function, F(x). The function F(x) may then be provided to an application on a computational device (e.g., a computer (laptop, desktop, etc.) or a mobile device (smartphone, tablet, etc.)), which may be used in the field to diagnose patients' eye diseases. In some embodiments, the computational device is a portable device that is fitted with hardware to enable the portable device to take a fundus image of the eye of a patient who is being tested for eye diseases, and then the application on the portable device processes this fundus image using the function F(x) to determine a diagnosis. In other embodiments, a previously-taken fundus image of the patient's eye is provided to a computational device (e.g., any computational device, whether portable or not), and the application processes the fundus image using the function F(x) to determine a diagnosis.
Preferably, the images in the dataset represent a heterogeneous cohort of patients with a multitude of retinal afflictions indicative of various ophthalmic diseases, such as, for example, diabetic retinopathy, macular edema, glaucoma, and age-related macular degeneration. Each of the input images in the dataset has been pre-associated with a diagnostic label of “healthy” or “diseased.” For example, the diagnostic labels may have been determined by a panel of medical specialists. These diagnostic labels may be any convenient labels, including alphanumeric characters. For example, the labels may be numerical, such a value of 0 or 1, where 0 is healthy and 1 is diseased, or a value, possibly non-integer, possibly in a range between a minimum value and a maximum value (e.g., in range of [0-5], which is simply one example) to represent a continuous risk descriptor. Alternatively or in addition, the labels may include letters or other indicators.
At block 100, representing a method 100, the dataset of fundus images is processed in accordance with the novel methods disclosed herein, discussed in more detail below. At block 40, the function F(x), which may be used thereafter to diagnose vision-degenerative diseases as described in more detail below, is provided as an output.
At block 120, the images in the dataset are optionally pre-filtered. If used, pre-filtering may result in the benefit of encoding robust invariances into the method, or it may enhance the final accuracy of the model. In some embodiments using pre-filtering, each image is rotated by some random number of degrees (or radians) using any computer randomizing technique (e.g., by using a pseudo-random number generator to choose the number of degrees/radians by which each image is rotated). In some embodiments that include block 120, each image is randomly flipped horizontally (e.g., by randomly selecting a value of 0 or 1, where 0 (or 1) means to flip the image horizontally and 1 (or 0) means not to flip the image horizontally). In some embodiments that include block 120, each image is randomly flipped vertically (e.g., by randomly selecting a value of 0 or 1, where 0 (or 1) means to flip the image vertically and 1 (or 0) means not to flip the image vertically). In some embodiments that include block 120, each image is skewed using conventional image processing techniques in order to account for real-world artifacts and brightness fluctuations that may arise during image acquisition with a smartphone camera. The examples provided herein are some of the many transformations that may be used to pre-filer each image, artificially augmenting the original dataset with variations of perturbed images. Other pre-filtering techniques may also or alternatively be used in the optional block 1202, and the examples given herein are not intended to be limiting.
Referring again to
For example, there may be N functions, denoted as g1, g2, g3, . . . , gN connected in a chain, to form g(x)=gN( . . . (g3(g2(g1(x))), where g1 is called the first layer of the network, g2 is called the second layer, and so on. The depth of the model is the number of functions in the chain. Thus, for the example given above, the depth of the model is N. The final layer of the network is called the output layer, and the other layers are called hidden layers. The learning algorithm decides how to use the hidden layers to produce the desired output. Each hidden layer of the network is typically vector-valued. The width of the model is determined by the dimensionality of the hidden layers.
In some embodiments, the input is presented at the layer known as the “visible layer.” A series of hidden layers then extracts features from an input image. These layers are “hidden” because the model determines which concepts are useful for explaining relationships in the observed data.
In some embodiments, a custom deep convolutional network uses the principle of “residual-learning,” which introduces identity-connections between convolutional layers to enable incremental learning of an underlying polynomial function. This may aid in final accuracy, but residual learning is optional—any variation of a neural network (preferably convolutional neural networks in some form with sufficient depth for enhanced learning power) may be used in the embodiments disclosed herein.
In some embodiments, the custom deep convolutional network contains many hidden layers and millions of parameters. For example, in one embodiment, the network has 26 hidden layers with a total of 6 million parameters. The deep learning textbook entitled “Deep Learning,” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press), available Online at http://www.deeplearningbook.org, provides information about how neural networks and convolutional variants work and is hereby incorporated by reference.
Referring again to
At blocks 150A and 150B, the extracted features are optionally normalized and/or compressed. In some embodiments including one or both of blocks 150A and 150B, the features from both layer A and layer B are normalized. For example, in embodiments using the last convolutional layer and the global average pool layer, the features may be normalized using L2 normalization to restrict the values to between [0,1]. If used, the normalization may be, achieved by any normalization technique, L2 normalization being just one non-limiting example. As indicated in
In addition or alternatively, at blocks 150A and 150B, the features may be compressed. For example, a kernel PCA function may optionally be used (e.g., on the features from the last convolutional layer) to map the feature vector to a smaller number of features (e.g., 1034 features) in order to enhance feature correlation before decision tree classification. The use of a PCA function may improve accuracy. For example, a kernel PCA may be used to map the feature vector of the last convolutional layer to a smaller number of features. Kernel PCA is just one option out of many compression algorithms that may be used to map a large number of features to a smaller number of features. Any compression algorithm may be alternatively be used (e.g., independent component analysis (ICA), non-kernel PCA, etc.). As indicated in
As illustrated in
One type of independent feature generation is statistical feature extraction. Thus, in addition to automated feature extraction with deep learning, thorough statistical feature extraction using conventional filters in the computer vision field may be used at optional block 180. In contrast to the fine-tuned filters learned through the deep learning training process, conventional filters may optionally be used for supplemental feature extraction to enhance accuracy. For example, statistical feature extraction may be performed using any of Riesz features, co-occurrence matrix features, skewness, kurtosis, and/or entropy statistics. In experiments performed by the inventor using Riesz features, co-occurrence matrix features, skewness, kurtosis, and entropy statistics, these features formed a final feature vector of 56 features. It is to be understood that there are many ways to perform statistical feature extraction, and the examples given herein are provided to illustrate but not limit the scope of the claims.
Another type of independent feature generation is handcrafted feature extraction. Thus, in addition to automated feature extraction with deep learning, handcrafted feature extraction may be utilized to describe an image. One may handcraft filters (as opposed to those automatically generated within the layers of deep learning) to specifically generate feature vectors that represent targeted phenomena in the image, (e.g., a micro-aneurysm (a blood leakage from a vessel), an exudate (a plaque leakage from a vessel), a hemorrhage (large blood pooling out of a vessel), the blood vessel itself, etc.). It is to be understood that there are many ways to perform handcrafted feature extraction (for example, constructing Gabor filter banks), and the examples given herein are provided to illustrate but not limit the scope of the claims. A person having ordinary skill in the art would understand the use of conventional methods of image discrimination and representation that would be useful in the scope of the disclosures herein.
The features extracted from the neural network (e.g., directly from blocks 140A and 140B, or from optional blocks 150A and 150B, if present) and, if present, independently generated features from optional block 180 are combined and fed into optional block 160, which concatenates feature vectors. In some embodiments, the feature vector concatenation is accomplished using a gradient boosting classifier, with the input being a long numerical vector (or multiple long numerical vectors) with the training label being the original diagnostic label.
At block 170, feature vectors are mapped to output labels. In some embodiments, the output labels are numerical in the form of the defined diagnostic label (e.g., 0 or 1, continuous variable between a minimum value and a maximum value (e.g., 0 to 5), etc.). This may be interpreted in many ways, such as by thresholding at various levels to optimize metrics such as sensitivity, specificity, etc. In some embodiments, thresholding at 0.5 with a single numerical output may provide adequate accuracy.
In some embodiments, the feature vectors are mapped to output labels by performing gradient-boosting decision-tree classification. In some such embodiments, separate gradient-boosting classifiers are optionally used separately on each bag of features. Gradient-boosting classifiers are tree-based classifiers known for capturing fine-grained correlations in input features based on intrinsic tree-ensembles and bagging. In some embodiments, the prediction from each classifier is weighted using standard grid-search to generate a final diagnosis Score. Grid search is a way for computers to determine optimal parameters. Grid search is optional but may improve accuracy. The use of gradient-boosting classifiers is also optional; any supervised learning algorithm that can map feature vectors to an output label may work, such as the Support Vector Machine classification or Random Forest classification, Gradient-boosting classifiers may have better accuracy than other candidate approaches, however. A person having ordinary skill in the art would understand the use of conventional methods to map feature vectors to corresponding labels that would be useful in the scope of the disclosures herein.
The output labels from blocks 160A and 160B are then provided to block 40 of
It is to be understood that
As another example, instead of processing both the illustrated left-hand branch (blocks 140A and 150A) and the right-hand branch (blocks 140B and 150B), one of the branches may be eliminated altogether. Furthermore, although
At block 230, the fundus image of the patient's eye is processed using the function F(x). For example, an app on a smartphone may process the fundus image of the patient's eye. At block 240, a diagnosis is provided as output. For example, the app on the smartphone may provide a diagnosis that indicates whether the analysis of the fundus image of the patient's eye suggests that the patient is suffering from a vision-degenerative disease.
An embodiment of the disclosed method has been tested using five-fold stratified cross-validation, preserving the percentage of samples of each class per fold. This testing procedure split the training data into five buckets of around 20,500 images. The method trained on four folds and predicted the labels of the remaining one, repeating this process once per fold. This process ensured model validity independent of the specific partition of training data used.
The implemented embodiment derived average metrics from five test runs by comparing the embodiment's predictions to the gold standard determined by a panel of specialists. Two metrics were chosen to validate the embodiment:
Area Under the Receiver Operating Characteristic (AUROC) curve: The receiver operating characteristic (ROC) curve is a graphical plot that illustrates the performance of a binary classifier by measuring the tradeoff between its true positive and false positive rates. The closer the area under this curve is to 1, the smaller the tradeoff, indicating greater predictive potential. The implemented embodiment scored an average AUROC of 0.97 during 5-fold cross-validation. This metric is a near perfect result, indicating excellent performance on a large-scale dataset.
Sensitivity and Specificity: Sensitivity and specificity indicate the rate of true positive cases among all classifications, whereas specificity measures the rate of true negatives. As indicated by Table 1 below, the implemented embodiment achieved an average 95% sensitivity and a 98% specificity during 5-fold cross validation. This statistic represents the highest point on the ROC curve with minimal tradeoff between precision and recall.
To validate the efficiency of the residual network in learning highly discriminative filters for optimal feature map generation,
To validate the prognostic performance of the implemented embodiment, an occlusion heatmap was generated on sample pathological fundus images. This heatmap was generated by occluding parts Of an input image iteratively, and highlighting regions of the image that greatly impact the diagnostic output in red while highlighting irrelevant regions in blue.
As explained previously, the disclosed methods may be implemented in a portable apparatus. For example, an apparatus for vision-degenerative disease detection comprises an external lens attached to a smartphone that implements the disclosed method. The smartphone may include an application that implements the disclosed method. The apparatus provides rapid, portable screening for vision-degenerative diseases, greatly expanding access to eye diagnostics in rural regions that would otherwise lack basic eye care. Individuals are no longer required to seek out expensive medical attention each time they wish for a retinal evaluation, and can instead simply use the disclosed apparatus for efficient evaluation.
The efficacy of the disclosed method was tested in an iOS smartphone application built using Swift and used in conjunction with a lens attached to the smartphone. This implementation of one embodiment was efficient in diagnosing input retinal scans, taking on average 10 seconds to generate a diagnosis. The application produced a diagnosis in approximately 8 seconds in real-time and was tested on an iPhone 5.
For proper clinical application, further testing and optimization of the sensitivity metric may be necessary in order to ensure minimum false negative rates. In order to further increase the sensitivity metric, it may be important to control specific variances in the dataset, such as ethnicity or age, to optimize our algorithm for certain demographics during deployment.
The disclosed method may be implemented on a computer programmed to execute a set of machine-executable instructions. In some embodiments, the machine-executable instructions are generated from computer code written in the Python programming language, although any suitable computer programming language may be used instead.
As shown in
Output devices may include, for example, a visual output device, an audio output device, and/or tactile output device (e.g., vibrations, etc.). Input devices may include, for example, an alphanumeric input device, such as a keyboard including alphanumeric and other keys, for enabling a user to communicate information and command selections to the microprocessor 1103. Input devices may include, for example, a cursor control device, such as a mouse, a trackball, stylus, cursor direction keys, or touch screen, for communicating direction information and command selections to the microprocessor 1103, and for controlling movement on the display & display controller 1108.
The I/O devices 1110 may also include a network device for accessing other nodes of a distributed system via the communication network 116. The network device may include any of a number of commercially available networking peripheral devices such as those used for coupling to an Ethernet, token ring, Internet, or wide area network, personal area network, wireless network, or other method of accessing other devices. The network device may further be a null-modem connection, or any other mechanism that provides connectivity to the outside world.
The volatile RAM 1105 may implemented as dynamic RAM (DRAM), which requires power continually in order to refresh or maintain the data in the memory. The non-volatile memory 1106 may be a magnetic hard drive or a magnetic optical drive or an optical drive or a DVD RAM or other type of memory system that maintains data even after power is removed from the system. Typically, the nonvolatile memory will also be a random access memory, although this is not required. Although
It will be apparent from this description that aspects of the method 100 may be embodied, at least in part, in software. That is, the techniques may be carried out in a computer system or other data processing system in response to its processor, such as a microprocessor, executing sequences of instructions contained in a memory, such as ROM 1107, volatile RAM 1105, non-volatile memory 1106, cache 1104 or a remote storage device. In various embodiments, hard-wired circuitry may be used in combination with software instructions to implement the method 100. Thus, the techniques are not limited to any specific combination of hardware circuitry and software, nor to any particular source for the instructions executed by the data processing system. In addition, various functions and operations may be performed by or caused by software code, and therefore the functions and operations result from execution of the code by a processor, such as the microprocessor 1103.
A non-transitory machine-readable medium can be used to store software and data (e.g., machine-executable instructions) that, when executed by a data processing system (e.g., at least one processor), causes the system to perform various methods disclosed herein. This executable software and data may be stored in various places including for example ROM 1107, volatile RAM 1105, non-volatile memory 1106 and/or cache 1104. Portions of this software and/or data may be stored in any one of these storage devices.
Thus, a machine-readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, mobile device, personal digital assistant, manufacturing tool, any device with a set of one or more processors, etc.). For example, a machine-readable medium includes recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.).
Note that any of all of the components of this system illustrated in
In the foregoing description and in the accompanying drawings, specific terminology has been set forth to provide a thorough understanding of the disclosed embodiments. In some instances, the terminology or drawings may imply specific details that are not required to practice the disclosed embodiments.
To avoid obscuring the present disclosure unnecessarily, well-known components (e.g., memory) are shown in block diagram form and/or are not discussed in detail or, in some cases, at all.
Unless otherwise specifically defined herein, all terms are to be given their broadest possible interpretation, including meanings implied from the specification and drawings and meanings understood by those skilled in the art and/or as defined in dictionaries, treatises, etc. As set forth explicitly herein, some terms may not comport with their ordinary or customary meanings.
As used in the specification and the appended claims, the singular forms “a,” “an” and “the” do not exclude plural referents unless otherwise specified. The word “or” is to be interpreted as inclusive unless otherwise specified. Thus, the phrase “A or B” is to be interpreted as meaning all of the following: “both A and B,” “A but not B,” and “B but not A.” Any use of “and/or” herein does not mean that the word “or” alone connotes exclusivity.
As used herein, phrases of the form “at least one of A, B, and C,” “at least one of A, B, or C,” “one or more of A, B, or C,” and “one or more of A, B and C” are interchangeable, and each encompasses all of the following meanings: “A only,” “B only,” “C only,” “A and B but not C,” “A and C but not B,” “B and C but not A,” and “all of A, B, and C.”
To the extent that the terms “include(s),” “having,” “has,” “with,” and variants thereof are used in the detailed description or the claims, such terms are intended to be inclusive in a mariner similar to the term “comprising,” i.e., meaning “including but not limited to.” The terms “exemplary” and “embodiment” are used to express examples, not preferences or requirements.
Although specific embodiments have been disclosed, it will be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure. For example, features or aspects of any of the embodiments may be applied, at least where practicable, in combination with any other of the embodiments or in place of counterpart features or aspects thereof. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
This application claims the benefit of, and hereby incorporates by reference the contents of, U.S. Provisional Application No. 62/383,333, filed Sep. 2, 2016 and entitled “SCREENING METHOD FOR AUTOMATED DETECTION OF VISION-DEGENERATIVE DISEASES FROM COLOR FUNDUS IMAGES.”
Number | Date | Country | |
---|---|---|---|
62383333 | Sep 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2017/049984 | Sep 2017 | US |
Child | 16288308 | US |