The present disclosure relates to assessing ultrasound measurement data using a machine learning model.
It is well known to use ultrasound to obtain information about structures inside the human or animal body. The information may include quantitative measurements for diagnostic use. For example, fetal brain biometry measurements to assess fetal growth can be performed, such as estimation of the head circumference (HC) and transcerebellar diameter (TCD). Such measurements may be performed by manipulating an ultrasound probe until an optimal ultrasound imaging plane is observed by a user of the probe. The parameter of interest is then obtained from the optimal imaging plane. This visual assessment of planes is time consuming, subjective and requires significant training to do well. The approach presents particular challenges where ultrasound image quality is sub-optimal, such as where portable and/or low-cost ultrasound probes are used.
It is an object of the invention to at least partially address one or more of the problems with the prior art discussed above and/or other problems.
According to an aspect of the invention, there is provided a computer-implemented method of training a machine learning model to assess ultrasound measurement data, the method comprising: (a) receiving training data comprising a plurality of classified frames of ultrasound measurement data, each of at least a subset of the classified frames being classified as representing an imaging plane capable of providing information about a respective target anatomical feature corresponding to the classified frame class; (b) selecting from the plurality of classified frames a first sample of frames and a second sample of frames; (c) using the machine learning model to derive a prototype feature vector for each of one or more target anatomical features, each prototype feature vector being derived from feature vectors obtained by inputting to the machine learning model frames from the first sample that belong to a classified frame class corresponding to a respective one of the target anatomical features; (d) using the machine learning model to derive a feature vector for each of the frames in the second sample; (e) calculating metrics representing respective distances, in an embedded space of the feature vectors, between each of the feature vectors derived in (d) and each of the prototype feature vectors derived in (c); and (f) iteratively modifying parameters of the machine learning model and repeating (b)-(e) to optimize a loss function that is a function of the metrics calculated in (e).
The method provides a trained machine learning model that is lightweight and computationally efficient. The model can assess frames of ultrasound data representing different imaging planes, and output numerical values (quality metrics) that indicate how suitable each frame is for extracting information of interest. The model can be implemented on modest computational hardware (e.g. on a mobile device such as a tablet or smart phone) and provide near-real-time feedback to an operator. The method is demonstrated to be effective even where relatively inexpensive ultrasound hardware is used. The approach can therefore be deployed in a wider range of settings than alternative approaches relying on expensive ultrasound probes or high-powered data processing.
According to a further aspect of the invention, there is provided a computer-implemented method of assessing ultrasound measurement data, comprising: providing a machine learning model trained using the method of training a machine learning model of any disclosed embodiment; receiving input data comprising a plurality of input frames of ultrasound measurement data, each input frame corresponding to a different imaging plane of ultrasound measurement data; and using the trained machine learning model to generate a quality metric for each of the input frames, the quality metric quantifying a relative capacity of the input frame to provide information about a respective one of the target anatomical features.
According to a further aspect of the invention, there is provided a computer-implemented method of assessing ultrasound measurement data, comprising: training a machine learning model using the method of training a machine learning model of any disclosed embodiment; receiving input data comprising a plurality of input frames of ultrasound measurement data, each input frame corresponding to a different imaging plane of ultrasound measurement data; and using the trained machine learning model to generate a quality metric for each of the input frames, the quality metric quantifying a relative capacity of the input frame to provide information about a respective one of the target anatomical features.
According to a further aspect of the invention, there is provided a method of determining information about an anatomical feature, comprising: performing ultrasound measurements on a subject to obtain a plurality of input frames of ultrasound measurement data, each input frame corresponding to a different imaging plane of ultrasound measurement data; and using a machine learning model trained using the method of training a machine learning model of any disclosed embodiment to generate a quality metric for each of the input frames, the quality metric quantifying a relative capacity of the input frame to provide information about a respective one of the target anatomical features.
According to a further aspect of the invention, there is provided an ultrasound system, comprising: an ultrasound probe; and a data processing system configured to perform the method of assessing ultrasound measurement data of any disclosed embodiment to assess ultrasound measurement data obtained by the ultrasound probe.
Embodiments of the disclosure will be further described by way of example only with reference to the accompanying drawings.
Various embodiments of the disclosure relate to methods that are computer-implemented. Each step of the disclosed methods may be performed by a computer in the most general sense of the term, meaning any device capable of performing the data processing steps of the method, including dedicated digital circuits. The computer may comprise various combinations of known computer elements, including for example CPUs, RAM, SSDs, motherboards, network connections, firmware, software, and/or other elements known in the art that allow the computer to perform the required computing operations. The required computing operations may be defined by one or more computer programs. The one or more computer programs may be provided in the form of media or data carriers, optionally non-transitory media, storing computer readable instructions. When the computer readable instructions are read by the computer, the computer performs the required method steps. The computer may consist of a self-contained unit, such as a general-purpose desktop computer, laptop, tablet, mobile telephone, or other smart device. Alternatively, the computer may consist of a distributed computing system having plural different computers connected to each other via a network such as the internet or an intranet.
Embodiments of the disclosure concern training and use of a machine learning model that can improve ultrasound measurements by providing an automated assessment of the quality of ultrasound measurement data, in particular on an image plane by image plane basis.
In step S1, the method comprises receiving training data. The training data comprises a plurality of classified frames of ultrasound measurement data. Each classified frame represents a single imaging plane. The classification of the classified frames indicates, for each of at least a subset of the classified frames, that the respective imaging plane is capable of providing information about a respective target anatomical feature (e.g. at a predetermined quality threshold or above). Each target anatomical feature may thus correspond to a respective one of the classified frame classes. The plurality of classified frames may comprise one or more classified frames that are classified according to a further class characteristic. The further class characteristic may be a characteristic other than being capable of providing information about a target anatomical feature (at a predetermined quality threshold or above). For example, the further class characteristic may be that the imaging plane represents background signal only (e.g. signal that is not capable of providing information about any target anatomic feature at a predetermined quality threshold or above). The classified frames may be referred to as labelled frames or labelled images, where the label represents the classification (e.g. which of the target anatomical features the classified frame can provide information about, or that the classified frames represent background only).
In some embodiments, the target anatomical features may include at least two target anatomical features. In some embodiments, the at least two target anatomical features comprise fetal head circumference (HC) and trans-cerebellar diameter (TCD). In such embodiments, the classified frames may correspond to imaging planes where information about HC or TCD is obtainable/visible. Various approaches may be used to quantify how suitable a given frame is for providing information about the target anatomical feature of interest. For example, a score may be calculated for each frame that represents how suitable that frame is for providing the information of interest. The frames may then be classified according to whether the score corresponding to a particular target anatomical feature is higher or lower than a predetermined threshold. For example, if a frame has a score that is higher than a respective predetermined threshold for TCD and lower than a respective predetermined threshold for HC, that frame can be classified as a TCD-suitable frame. If a frame has a score that is higher than the respective predetermined threshold for HC and lower than the respective predetermined threshold for TCD, that frame can be classified as a HC-suitable frame. If a frame has a score that is lower than the thresholds for all of the target anatomical features (e.g. HC and TCD), that frame may be classified as a background frame (i.e. not capable of providing information about any of the target anatomical features at a predetermined quality threshold or above).
In one embodiment, the following clinical criteria for scoring frames was used. Frames suitable for assessing TCD were scored from 1-9 based on the following factors: the relevant feature in the frame being horizontal scores 1; 30% or more magnification scores 1; symmetrical hemispheres scores 1; cavum septum pellucidum (CSP) (clear scores 2, suspected scores 1); Thalami (clear scores 2, suspected scores 1); cerebellar edge (clear scores 2, unclear scores 1). Frames suitable for assessing HC were scored from 1-9 based on the following factors: the relevant feature in the frame being horizontal scores 1; 30% or more magnification scores 1; symmetrical hemispheres scores 1; cavum septum pellucidum (CSP) (clear scores 2, suspected scores 1); Thalami (clear scores 2, suspected scores 1); No cerebellum visible scores 1; HC oval scores 1. In the detailed example described below, two experienced sonographers annotated three frames only per video (TCD, HC and background) and scored them. Frames having a score of 6 or more were deemed to be capable of providing information about the target anatomical feature of interest (TCD or HC) and classified accordingly. Frames having lower scores were classified as background.
In step S2, a selection is made from the classified frames received in step S1. The selection includes a first sample of frames and a second sample of frames. Each sample comprises multiple frames. The first sample of frames may be referred to as a support set. The second sample of frames may be referred to as a query set. The query set is typically significantly larger than the support set.
In step S3, a machine learning model is used to derive a prototype feature vector for each of one or more target anatomical features. Various machine learning models may be used in principle. In some embodiments, the machine learning model comprises a deep learning algorithm. The deep learning algorithm may comprise a convolutional neural network for example.
Each prototype feature vector may be derived as follows. Feature vectors are obtained by inputting to the machine learning model frames from the first sample that correspond to (i.e. belong to a classified frame class that corresponds to) a respective one of the target anatomical features. For example, in an embodiment where the target anatomical features comprise TCD and HC, a first prototype feature vector may be obtained for TCD as the target anatomical feature and a second prototype feature vector may be obtained for HC as the target anatomical feature. Each prototype feature vector may be obtained by averaging over feature vectors corresponding to the respective classified frame class. Thus, a prototype feature vector for a given target anatomical feature may be obtained by averaging over the feature vectors that correspond to that target anatomical feature. The first prototype feature vector may thus be obtained by averaging over feature vectors obtained for TCD as the target anatomical feature. The second prototype feature vector may be obtained by averaging over feature vectors obtained for HC as the target anatomical feature.
In some embodiments, step S3 further comprises deriving a prototype feature vector for a further class characteristic such as background signal only. The prototype feature vector for the further class characteristic is derived from feature vectors obtained by inputting to the machine learning model frames from the first sample that are classified as corresponding to the further class characteristic (e.g. background signal only).
More generally, prototype feature vectors can be derived for a wide range of class characteristics, including characteristics related to the type of imaging plane, the level of quality of the imaging plane, and others. A prototype feature vector in the context of the present disclosure encompasses any categorical representation in an embedding space. Typically, the prototype feature vector will be a high dimensional vector containing semantic information of a specific class.
In step S4, the machine learning model is used to derive a feature vector for each of the frames in the second sample.
In step S5, the method comprises calculating metrics representing respective distances (e.g. Euclidean distances), in an embedded space of the feature vectors, between each of the feature vectors derived in step S4 and each of the prototype feature vectors derived in step S3.
For example, in an embodiment where the target anatomical features comprise TCD and HC, metrics can be calculated that represent distances between each feature vector and a prototype feature vector representing TCD (derived in step S3) and distances between each feature vector and a prototype feature vector representing HC (derived in step S3).
In step S6, the method comprises iteratively modifying parameters of the machine learning model and repeating steps S2-S5 to optimize a loss function that is a function of the metrics calculated in step S5. Each of two or more of the repetitions may be performed with different first and/or second samples.
Various loss functions may be used. In some embodiments, as exemplified in the detailed example below, a probability distribution over all classes is calculated with a distance-based softmax. A cross-entropy loss term may then be calculated. Because this term depends on the metrics representing distances, this term may be referred to as a metric-based cross-entropy loss term Lmetric. An example of this type of term is discussed below in the section “Prototypical Learning Module”. The loss function may thus include a first cross-entropy loss term that is a function of the metrics calculated in step S5 for feature vectors derived from the first and second samples selected in step S2. The loss function may be configured to introduce a constraint that data from the same class should be similar in the embedding space, for example by favouring lowering of distances in the embedded space, for each classified frame class (e.g. corresponding to a given target anatomical feature), between the prototype feature vector for the classified frame class and the feature vectors derived in step S4 from frames corresponding to the same classified frame class (e.g. the target anatomical feature to which the prototype feature vector corresponds).
In some embodiments, step S1 further comprises receiving a plurality of unclassified frames of ultrasound measurement data. The unclassified frames may be referred to as unlabelled frames or unlabelled images. In such embodiments, step S2 may further comprise selecting from the plurality of unclassified frames a third sample of frames. The third sample of frames may be referred to as an unlabelled set. Step S4 may then further comprise using the machine learning model to derive a feature vector for each of the frames in the third sample.
In an embodiment, the loss function used in step S6 may include a second loss term that is a function of the metrics calculated in step S5 for feature vectors derived from the third sample. The second loss term may comprise a temperature-tuned entropy loss that causes transfer of information from the prototype feature vectors derived in step S3 to unlabelled datapoints by minimizing the entropy of a temperature tuned softmax. The second loss term may be referred to as a semantic transfer loss term LST. An example of this type of term is discussed below in the section “Unsupervised Semantic Transfer”.
In an embodiment, the machine learning model comprises a further fully-connected layer that maps the feature vectors of the frames in the second sample to class scores directly. This serves as an auxiliary classifier and may be trained with a cross-entropy loss to provide a further cross-entropy loss term, LCE, in the loss function using in step S5. In an embodiment, training signal annealing (TSA) is introduced to the further cross-entropy loss term, LCE, as exemplified in further detail below.
In step S11, the method comprises providing a machine learning model trained using any of the methods for training a machine learning model described herein (e.g. with reference to
In step S12, input data comprising a plurality of input frames of ultrasound measurement data is received. Each input frame represents a different imaging plane of ultrasound measurement data.
In step S13, the trained machine learning model generates a quality metric for each of the input frames. The quality metric quantifies a relative capacity of the input frame to provide information about a respective one of the target anatomical features. The quality metric may be calculated by using the trained machine learning model to calculate a probability that a feature vector corresponding to a respective input frame belongs to a particular class of feature vectors represented by the model. A probability distribution for each of the classes over an embedded space may, for example, be used to determine the probability that the feature vector belongs to each of the available classes (e.g. target anatomical features). If the probability is very high for a first of the classes and low for all of the other classes, it may be concluded that the quality metric for the input frame should be high in respect of the first class (and low in respect of all of the other classes).
In some embodiments, the trained machine-learning model is used to generate a plurality of quality metrics for each input frame. Each quality metric may quantify a relative capacity of the input frame to provide information about a different respective one of the target anatomical features. For example, a quality metric for HC may be calculated for the input frame and a quality metric for TCD may be calculated for the input frame. The quality metrics are calculated for each input frame by calculating a feature vector corresponding to the input frame and comparing the calculated feature vector with a probability distribution over the embedded space for each target anatomic feature (e.g. with a probability distribution for HC and a probability distribution of TCD).
The generated quality metrics may be used as the basis for selecting a suitable frame for further processing. For example, an input frame may be selected on the basis of a generated quality metric (e.g. whether a quality metric of interest is high). The selected input frame may then be processed to determine information about the target anatomical feature corresponding the generated quality metric that was used to perform the selection. For example, when a quality metric for HC is high, a corresponding input frame may be processed to obtain information about HC. Alternatively, a distribution of quality metrics may be used to decide whether or not a sequence of video contains any frames of adequate quality to obtain the required information. If none of the frames are good enough the video should be taken again.
Embodiments of the disclosure thus allow feedback to be obtained about a relevant quality of ultrasound imaging data while a video is being taken. The feedback is based on automatically obtained quality metrics and may assist a user to decide whether to accept or retake the video and/or how to select the most useful frames in the video. The quality assessment may run through each acquired video frame by frame. An example screen shot from an example implementation is shown in
For each training iteration, we randomly sample a query set XQ (a small batch of labelled images, referred to above as a “first sample of frames” in the discussion referring to
In this example, MobileNet (Howard, A. G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv: 1704.04861 (2017)) introducing depth-wise separable convolution was used to reduce the computational cost and the number of trainable parameters of convolution layers (other architectures might also be used). This is achieved by applying a channel-wise convolution first followed by 1×1 point-wise convolution to linearly combine the feature maps across the channels. The complete CNN architecture is summarized in Table.1 that consists of 30 layers (14 convolutional layers, 13 depth-wise convolutional layers, 1 global average pooling layers, 1 fully connected layer and 1 softmax classifier). Each video frame is fed through the CNN and outputs a 7×7 feature map. A global pooling operation is applied to pool the feature maps channel wise (1024 feature channels) to have a 1024-dimensional feature vector. Class scores are obtained by a fully connection operation that maps the feature vector to class scores and then a softmax is used to generate a probability distribution across classes.
To avoid overfitting, a prototypical learning module 21 may be introduced as illustrated in
Then, a query set of NQ labelled frames (larger than the support set) is sampled. For each query point (xq,yq), where xq∈XQ and yq∈{1, . . . ,K} (K=3 in this example), the Euclidean distance to each prototype is measured and a probability distribution over all classes is produced with a distance based softmax. The metric based cross-entropy loss is then calculated:
With the loss, the constraint that data from the same class should be similar in the embedding space is introduced. Also, it is found that this loss can provide a guidance to stabilize the unsupervised semantic transfer.
A semantic transfer objective, LST(XU) is defined that transfers information from the prototypes produced above to unlabelled datapoints by minimizing the entropy of a temperature-tuned softmax. Entropy minimization can be used for unsupervised and semi-supervised learning by encouraging low density separation between classes. At each training epoch, an unlabelled set XU of NU frames (NU=NQ) is sampled, then for each datapoint in the unlabelled set xu∈XU the Euclidean distance between its feature embedding and each categorical prototype is again computed. The semantic transfer loss is then defined as:
Where P(xu,Protok) is a softmax function that generates a probability distribution over all classes based on: −∥fθ(xu)−Protok∥2k and τ is the temperature of the softmax. The softmax temperature can then be tuned. As shown in
In addition to the above, in the present example another fully-connected layer is introduced that maps the feature vectors of query frames to class scores directly. This serves as an auxiliary classifier, trained with a cross-entropy (CE) loss, making it possible to investigate the interactions between direct learning and metric learning. To mitigate overfitting, in one realisation training signal annealing (TSA) (e.g. as described in Xie, Q., Dai, Z., Hovy, E., Luong, M. T., Le, Q. V.: Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 (2019)) is introduced to the cross-entropy loss:
Where I{·} is the indicator function and Pθ(yq|xq) is the probability of xq belonging to the class yq. Specifically, the example (xq,yq) does not contribute to the loss function if the model predicted probability surpasses a threshold ηt, at training step t. We set ηt=exp
that corresponds to the exponential schedule in Xie et al. realising most of the supervised signal at the end of training. Intuitively, this is to prevent the model from overfitting too quickly by penalizing over confident prediction in the early stage of training. Finally, the model jointly optimizes over the objective function as follows:
L(XS,XQ,XU)=LST(XU)+αLmetric(XS,XQ)+βLCE(XQ) (5)
where the hyperparameters α and β determine the influence of the metric learning and the direct learning, respectively.
Evaluation of the models was performed using Average Precision (AP) measured on frame level labels. For each class, a correct detection is counted if it is a positive prediction with confidence above a certain threshold (ranging from 0.1 to 0.9). AP is reported for the individual classes as well as mean Average Precision (mAP) in Table 2. It was found that directly learning with CE loss can result in poor generalization to test data as indicated by the baseline model (CE), which is the worst among all models. Note that there is an improvement over all metrics after applying TSA to the CE loss but it is marginal. Moreover, it was found that L metric (Eq.2) can significantly improve model generalization. When applying a full metric learning signal (i.e. fixing α=1), all metrics increase when reducing the contribution of the cross-entropy loss (i.e. reducing β) (Eq. 5).
The best performance is achieved when β equals to 0.1, whereas, when applying a full cross-entropy signal (i.e. fixing β=1), the performance of models overall is less superior than for models trained with a full metric learning signal. Of note, all metrics drop in value when gradually reducing the contribution of the metric learning loss from 0.5 to 0.1 to 0. It was also found that, overall, there is a higher AP for the HC than TCD and this may be because some low quality TCD frames (with unclear cerebellar edges) are prone to be confused as a HC frame. t-SNE was also performed on the test dataset, as summarized in
In the heat maps of
To evaluate models' on-device performance, as shown in
Number | Date | Country | Kind |
---|---|---|---|
20390003.0 | Sep 2020 | EP | regional |
2017510.5 | Nov 2020 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/076060 | 9/22/2021 | WO |