Condition and property detection, such as various types of diagnostics for instance, may be performed in a variety of ways that can differ considerably depending on the condition or property serving as the subject of analysis, but often rely on manual processes. As one example, medical diagnostics often require extraction and testing of a blood or tissue sample, or expert review and interpretation of images or test results generated by sophisticated testing equipment such as computerized tomography (CT) or magnetic resonance imaging (MRI) scanners, electrocardiogram (ECG) machines, and the like. As another example, diagnostics performed on industrial equipment or other machines typically require human inspection, or at the very least review of sensor data by a trained human technician. Despite the diversity of the diagnostic techniques in use, a common element among many is the need for a human having some level of expertise to participate in the process. However, given the costliness of such human involvement, there exists a need in the art for automated solutions capable of inferentially interpreting diagnostic data.
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may he indicated by :like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
The present application discloses systems and methods for performing machine learning model based condition and property detection. It is noted that although the present condition and property detection solution is described below in detail by reference to
Specific example use cases for the present novel and inventive concepts may include using video to predict early onset Alzheimer's disease or Parkinson's disease, or o predict a leg injury in a subject, for instance, based on walking or other movement by the subject. Alternatively, or in addition, video may be used to predict that a subject has had a stroke based on upper body movements or facial movements or expressions by the subject. As yet another alternative, or additionally, AV content or audio content may be used to diagnose malfunction of an appliance, such as a washing machine, tip or the need to replace a timing belt or other drive component of a car. Nevertheless, it is emphasized that any particular use case described or alluded to in the present application is not to be interpreted as limiting.
In some implementations, the systems and methods disclosed by the present application may be substantially or fully automated. As used in the present application, the terms “automation,” “automated,” and “automating” refer to systems and processes that do not require the participation of a human user, such as a human system administrator. Although, in some implementations, an engineer or medical professional may review the performance of the automated systems operating according to the automated processes described herein, that human involvement is optional. Thus the processes described in the present application may be performed under the control of hardware processing components of the disclosed systems.
It is noted that the present media property prediction solution is machine learning model based. As defined in the present application, a “machine learning model,” or “ML model,” refers to a mathematical model for making future predictions based on patterns learned from samples of data obtained from a set of trusted known matches and known mismatches, known as training data, Various learning algorithms can be used to map correlations between input data and output data. These correlations form the mathematical model that can be used to make future predictions on new input data. Such a predictive model may include one or more logistic regression models, Bayesian models, or artificial. neural networks (NNs), for example. In addition, machine learning models may be designed to progressively improve their performance of a specific task.
A NN is a type of machine learning model in which patterns or learned representations of observed data are processed using highly connected computational layers that map the relationship between inputs and outputs. A “deep neural network” (deep NN), in the context of deep learning, may refer to a NN that utilizes multiple hidden layers between input and output layers, which may allow for learning based on features not explicitly-defined in raw data. As used in the present application, a feature labeled or described as a NN refers to a deep neural network. In various implementations, NNs may be utilized to perform image processing, audio processing, or natural-language processing, for example.
As further shown in
Although the present application refers to software code 108 as being stored in system memory 106 for conceptual clarity, more generally, system memory 106 may take the form of any computer-readable non-transitory storage medium. The expression “computer-readable non-transitory storage medium”, as used in the present application, refers to any Medium excluding a carrier wave or other transitory signal that provides instructions to processing hardware 104 of computing platform 102 or to respective processing hardware of user systems 140a-140d. Thus, a computer-readable non-transitory storage medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM), while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory storage media include, for example, optical discs such as DVDs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
Processing hardware 104 of system 100 may include multiple hardware processing units, such as one or more central processing units, one or more graphics processing units, and one or more tensor processing units, one or more field-programmable gate arrays (FPGAs), custom are for machine-learning training or inferencing, and an application programming interface (API) server, for example. By way of definition, as used in the present application, the terms “central processing unit” (CPU), “graphics processing unit” (GPU), and “tensor processing unit” (TPU) have their customary meaning in the art. That is to say, a CPU includes an Arithmetic Logic Unit (ALU) for carrying out the arithmetic and logical operations of computing platform 102, as well as a Control Unit (CU) for retrieving programs, such as software code 108, from system memory 106, while a GPU may be implemented to reduce the processing overhead of the CPU by performing computationally intensive graphics or other processing tasks. A TPU is an application-specific integrated circuit (ASIC) configured specifically for artificial intelligence (AI) processes such as machine learning.
Although
It is further noted that, although user systems 140a-140d are shown variously as smartphone computer 140a, video camera 140b, microphone 140c, and machine or diagnostic device 140d, in
User system 240 corresponds in general to any or all of user systems 140a-140d in
User system processing hardware 244 may include multiple hardware processing units, such as one or more CPUs, ogre or more GPUs one or more TPUs, and one or more FPGAs, for example, as those features are defined above.
With respect to user software application 250, it is noted that in some implementations, user software application 250 may be a thin client application of software code 108, in
According to the exemplary implementation shown in
System 100 and user system 240 are further described below by reference to
By way of background, existing voice-based methods for detecting COVID-19 require analysis of pre-identified utterances of interest, such as coughs, manual segmentation of existing ground truth audio data sets by human researchers in order to isolate such utterances, and test subjects who are required to perform these specific forced utterances, e.g., forced coughing. These conventional approaches hamper the creation of an effective voice-based COVID-19 detector for several reasons. For example, by limiting analysis to pre-identified utterances of interest the possible solutions obtainable are restricted to only those that can arise from preconceived hypotheses, thereby hindering serendipity. In addition, manual segmentation of data prevents end-to-end processes from being automated, which impedes the rapid iterations and convergence typically made possible by machine-learning approaches. Moreover, requiring collection of uncommon utterances limits data collection. opportunities to laboratory scenarios or coached data collection initiatives, while requiring collection of symptom-based utterances restricts opportunities to collect data from asymptomatic disease carriers.
In the exemplary use case of a novel infectious disease, such as COVID-19, for which an extensive knowledge base is under development, the prediction solution disclosed in the present application overcomes the aforementioned deficiencies in the conventional art by implementing a multi-step ML model based process that can predict disease presence by automatically segmenting unstructured vocal sample data into normalized datasets, each data element being of specific audio segment types that are determined to be optimal for prediction of COVID-19 infection, and using those datasets to train disease predictors to be. more flexible and precise than conventional approaches allow.
For example: if it is known that analysis of the audio properties of a certain type of vocal utterance, such a “mmmmm,” for instance, is most effective for prediction of COVID-19, the present prediction solution can take a dataset of unstructured voice samples that are tagged with COVID-19 status (collected from hospitals, for example), extract the “mmmmm” utterance segments into a normalized dataset that retains the corresponding COVID-19 status tags, and use this dataset to train and deploy a composite ML model that can predict COVID-19 status, based upon input of an unstructured vocal sample. It is noted that the vocal utterance identified in the present application as “mmmmm” refers to a sustained consonantal sound known formally as the “voiced bilabial nasal,” identified by the symbol (m) in the International Phonetic Alphabet. That is to say, the vocal utterance “mmmmm” is produced by sustaining the sound of the English letter “m” at the end of English word “them.”
Although this example relies on prior knowledge that a segment type based on the “mmmmm” utterance would be useful for ML model based prediction of COVID-19, the present prediction process can be performed with various different segment types in parallel, such as isolating utterances of each vocal phoneme into separate datasets, for example, in order to determine which segment types are most effective for prediction of COVID-19. This may be advantageous in cases where no a priori or existing a posteriori knowledge exists, as well as for identifying segments which can expand data collection opportunities. For instance, a segment based on a phoneme sound such as “oΩ” would be collectible from normal speech (and thus amenable to ambient, passive data collection approaches), but a segment based on a coughing sound would only be collectible from people who are symptomatic or from those who are instructed to cough via a coached data collection process.
To illustrate the process outlined above in detail, consider the exemplary use case in which the objective is to create a voice-based predictor for COVID-19 based on input of normal speech, and that there is reason to believe that analysis of the vocal resonances in the sound “mmmmm” will be particularly useful for prediction of COVID-19. Under those circumstances the present prediction solution may proceed as follows:
1: Referring to
2: Referring to
4: Referring to
5: Referring to
The performance of ML+ can be improved over time by direct training via datasets in the form of DS3, or through independent improvements in ML1, ML2, or ML3. In addition, any ground truth data from additional COVID-19 test results, such as polymerase chain reaction (PCR) tests for example, can be used to augment dataset DS3, and consequently refine ML3 and parameters, such as acceptable prediction likelihood thresholds, averaging processes used for multiple ML3 predictions in ML+, and the like. ML+ may be a feature of software code 108 of system 100, or of user software application 250 of user system 240. Moreover, procedure 5 described above may be performed by software code 108, executed by processing hardware 104 of system 100, or by user software application 250, executed by user system processing hardware 244.
As noted above, in some implementations, procedures 1 through 5 may be performed on system 100, while in other implementations procedures 1 through 5 may be performed on user system 240. However, in other implementations, ML+ may be deployed to user software application 250 on user system 240 after its creation on system 100. In still other implementations, ML+ may be deployed to system 100 after its creation elsewhere, and software code 108, when executed by processing hardware 104, may utilize that pre-existing ML+ to predict the respiratory infection (e.g., COVID-19) status using unstructured voice samples.
In the above-described case prior knowledge that the vocal resonances in the sound “mmmmm” would be particularly useful for prediction of COVID-19 was presumed. In the absence of such knowledge, or as a supplement to it, it is noted that procedures 1 through 4 can be used to generate multiple different instances of ML3 in parallel. In such an implementation, procedures 1 through 4 could be performed for each vocal phoneme instead of just the sound “mmmmm,” for example, and the most efficacious vocal segment types for prediction of COVID-19 could be determined. This could be accomplished by creating DS1 datasets tagged with Yes/No for each of the different segment types, or all at once in a single non-binary segmenter via the procedures depicted in
An example of generating multiple different instances of ML3 is shown in
As an example, in some implementations, ML+ can be enhanced as the result of the generation of multiple instances of ML3 using different instances of ML2 as shown in
Because the process depicted in
Moreover, for the specific use case of diagnosing COVID19 and other infectious diseases, because the present ML model based diagnostic solution is configured to detect human manifestations of the disease state, it is agnostic, and therefore remains effective as a diagnostic tool, even when infectious vectors mutate. Thus, in contrast to rapid antigen tests for COVID-19, which are to some extent variant specific, and tend to fail when the severe acute respiratory syndrome coronavirus 2 (BARS-CoV-2) causing COVID-19 mutates, the present ML model based diagnostic solution can be expected to be, and remain, robustly reliable against viral sub-variants.
As an additional advantage with respect to acquisition and management of personally identifiable information (PII) or other sensitive personal information, in implementations in which ML+ is deployed to user software application 250, any PII acquired by user software application 250 may be sequestered on user system 240 and be unavailable to system 100 or other external agents.
The functionality of system 100, user system(s) 140a-140d/240, software code 108, and user software application 250 shown variously in
Referring to
Alternatively, and as noted above, in some implementations, the diagnostic processing of the one of datasets 120a-120d may be performed locally on one of respective user systems 140a/240, 140b/240, 140c/240, or 140d/240 In these implementations, the dataset received in action 1062 may be received by user software application 250, executed by user system processing hardware 244.
Flowchart 1060 further includes performing an analysis of the dataset received in action 1062, using a first stage, i.e., ML2 of trained ML model ML+, to detect the presence of a predetermined data attribute (action 1064). For example, in the case of the COVID-19diagnostic procedure described above, processing hardware 104 may execute software code 108, or user system processing hardware 244 may execute user software application 250 to determine whether the dataset received in action 1062 includes the sound “mmmmm” and the bounding timestamps of regions that include that characteristic oaf interest.
Thus, in some implementations, the predetermined data attribute having its presence analyzed in action 1064 may audio attribute of the dataset received in action 1062. In implementations in which the predetermined data attribute is an audio attribute, that audio attribute may be derived from one or more of speech, a non-verbal utterance, or a pulmonary expulsion, such as a cough for example. Alternatively, or in addition, the predetermined attribute the predetermined data attribute having its presence analyzed in action 1064 may be a visual attributed in the form of a human tremor or tic.
Thus, in some implementations, processing hardware 104 may execute software code 108, or user system processing hardware 244 may execute user software application 250 to utilize a visual analyzer included as a feature of software code 108 or user software application 250, an audio analyzer included as a feature of software code 108 or user software application 250, or such a visual analyzer and audio analyzer, to perform the analysis of the received dataset in action 1064.
In various implementations, a visual analyzer included as a feature of software code 108 or user software application 250 may be configured to apply computer vision or other Al techniques to the dataset received in action 1062, or may be implemented as a NN or other type of ML model. Such a visual analyzer may be configured or trained to recognize physical movements and their frequency, for example.
An audio analyzer included as a feature of software code 108 or user so -are application 250 may also be implemented as a NN or other ML model. As noted above, in some implementations, a visual analyzer and an audio analyzer may be used in combination to the received dataset. It is noted that the received dataset will typically include multiple video frames, multiple audio frames, or multiple video frames and multiple audio frames. In some of those use cases, processing hardware 104 may execute software code 108, or user system processing hardware 244 may execute user software application 250 to perform the visual analysis of the received dataset, the audio analysis of the received dataset, or both the visual analysis and the audio analysis, on a frame-by-frame basis. That is to say, in various implementations, the analysis of the received dataset n action 1064 may be performed by software code 108, executed by processing hardware 104 of system 102, or by user software application 250, executed by user system processing hardware 244.
In some implementations, performing the analysis of the dataset in action 1064 may include detecting, using first stage ML2 of trained ML model ML+, one or more temporal segments of the received dataset that include the predetermined data attribute. In some implementations in which first stage ML2 of trained ML model ML+ is configured to detect one or more temporal segments of the received dataset that include the predetermined data attribute, first stage ML2 may be trained using a dataset, DS2, that has been annotated to identify the predetermined data attribute, to detect temporal segments of a test dataset that include the predetermined data attribute. As noted above, DS2 may be created or obtained by system 100 or user system(s) 140a-140d/240. In implementations in which DS2 is created by system 100 or user system(s) 140a-140d/240, DS2 may be generated by training another ML model to detect the presence of the predetermined data attribute in other test data, and using an output of that ML model to train yet another ML model to predict bounding timestamps for a temporal segment of that other test data that include the predetermined data attribute.
In some implementations, the training of first stage ML2 using DS2, the generation of DS2, or the generation of DS2 and the training of first stage ML2 using DS2, may be performed by software code 108, executed by processing hardware 104 of system 102, or by user software application 250, executed by user system processing hardware 244.
Flowchart 1060 further includes predicting, using second stage ML3 of trained ML model ML+ when the analysis of the dataset performed in action 1064 detects the presence of the predetermined data attribute, a probability that the predetermined data attribute is indicative of a condition or a property (action 1066). In some implementations in which second stage ML3 of trained model ML+ is used to predict the probability hat the predetermined data attribute is indicative of a condition, that condition may be one of a physical condition, a disease state, or a chronic medical condition, for example, as noted above. Alternatively and as further noted above, in other implementations in which second stage ML3 of trained ML model ML+ is used to predict the probability that the predetermined data attribute is indicative of a condition, that condition may be the operating performance of a machine, such as its output, energy consumption, heat generation, or overall efficiency, for example.
In implementations in which first stage ML2 of trained ML model ML+ is configured to detect one or more temporal segments of the received dataset that include the predetermined data attribute, predicting the probability that the predetermined data attribute is indicative of the condition or the property using ML3 in action 1066, may include predicting whether at least one of those one or more temporal segments including the predetermined data attribute is indicative of the condition or the property. In some of those implementations, second stage ML3 may be trained using a dataset, DS4, which has been annotated to correlate the predetermined data attribute with one of the condition or the property, to predict whether a temporal segment including the predetermined data attribute is indicative of the condition or the property.
Trained ML model ML+ may then be validated using validation data having a known ground truth, by delivering the validation data as an input to first stage ML2 and obtaining a prediction for the condition or the property as an output from second stage ML3, In some implementations, training of ML3, as well as validation of trained ML model ML+, may be perforated by software code 108, executed by processing hardware 104 of system 102, or by user software application 250, executed by user system processing hardware 244. It is noted that in various implementations, one or both of first stage ML2 and second stage ML3 of trained ML model ML+ may be trained using a federated learning process, as known in the art. It is further noted that with respect to the method outlined by flowchart 1060, in some implementations actions 1062, 1064, and 1066, may be performed in an automated process from which human participation may be omitted.
Thus, the present application discloses systems and methods for performing ML model based condition and property detection. In the exemplary use case of infectious disease prediction, the present ML model based diagnostic solution can render real-time disease state predictions for asymptomatic as well as symptomatic disease carriers in a manner that does not require special equipment or specially trained personnel, can be deployed rapidly, ubiquitously, and in a privacy-preserving way.
Moreover, the present application discloses a ML model based condition and property detection solution that can be deployed on any computer or smartphone either within its own application or embedded within another application. Consequently, the present ML model based condition and property detection solution can advantageously be deployed in an active manner, such as part of a multi-step screening process at a public or private event, or in any venue, such as an airport or cruise ship, for example, designed to host large groups. Alternatively the present ML model based condition and property detection solution may be deployed in an ambient manner (working in the background of a mobile phone software application for example) and thereby create a system that can not only provide notice to the individual user, but may also, when the user opts in or otherwise gives informed consent, contribute to national or global real-time status/outbreak warning systems. It is emphasized that even this use case can be implemented in a privacy preserving way, because, as noted above, this ML model based condition and property detection solution can be deployed locally on each device, not requiring the sending of audio data or PII to an external server in order to render a disease state or other prediction. Additionally, because the present ML model based condition and property detection solution can employ a multi-step automated segmentation process, as described above, which allows for unstructured input data to be usable for both training and prediction purposes, it advantageously produces normalized datasets that are ideally suited for machine learning.
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.
The present application claims the benefit of and priority to a pending Provisional Patent Application Ser. No. 63/194,018 filed on May 27, 2021, and titled “Condition and Media Property Prediction via Machine Learning Model Based Temporal Segmentation of Media,” which is hereby incorporated fully by reference into the present application.
Number | Date | Country | |
---|---|---|---|
63194018 | May 2021 | US |