This disclosure relates generally to improving the operation of computing platforms by providing contextual execution of control inputs associated with user intents. More specifically, this disclosure relates to deep-learning based crowd-assisted systems for contextual labeling, including, without limitation, systems for deep labeling.
Improved sensor technologies and machine-based voice and image recognition technologies provide important building blocks for future processor based apparatus which able to operate in response to user commands which, through their reliance intents and inferences, mimic the context-dependent ways in which humans communicate. However, the technical challenges associated with developing such contextually intelligent apparatus include, without limitation, developing corpuses of contextual labels associated with locations or other sources of context for user commands. For example, manual labelling of geographic data may be inadequate, given the sheer number of locations to be labelled and the limited accuracy of humans performing labelling. Further, the technical challenges associated with developing contextually intelligent apparatus include the fact that many machine learning techniques, such as deep neural networks (DNN), may require training on very large datasets. However, for many applications, a suitable dataset of sufficient size for training a DNN may not be available. Embodiments as disclosed and described herein are addressed to meeting these and other technical challenges associated with developing apparatus with contextual intelligence.
This disclosure provides systems and methods for deep labeling.
In a first embodiment, an apparatus includes a processor and a memory containing instructions, which when executed by the processor, cause the apparatus to receive one or more pretrained deep learning models, each pretrained deep learning model associated with a source domain, receive image data to be labeled, and input the received image data to each of the one or more pretrained deep learning models. Further, the instructions, when executed by the processor, cause the processor to perform an adaptation on one or more of the pretrained deep learning models, provide, from each of the pretrained deep learning models, an output in a target domain, provide an ensemble output, the ensemble output comprising labels for the image data determined based on the outputs from each of the pretrained deep learning models, and when the target domain is not completely covered by the source domain associated with a pretrained deep learning model of the one or more pretrained deep learning models, perform transfer learning on the pretrained deep learning model.
In a second embodiment, a method for contextual labeling of image data includes receiving, one or more pretrained deep learning models, each pretrained deep learning model associated with a source domain, receiving image data to be labeled, and inputting the received image data to each of the one or more pretrained deep learning models. Additionally, the method includes performing an adaptation on one or more of the pretrained deep learning models, providing, from each of the pretrained deep learning models, an output in a target domain for the pretrained deep learning model, providing an ensemble output, the ensemble output comprising labels for the image data determined based on the outputs from each of the pretrained deep learning models, and when the target domain is not completely covered by the source domain associated with a pretrained deep learning model of the one or more pretrained deep learning models, perform transfer learning on the pretrained deep learning model.
In a third embodiment, a non-transitory computer-readable medium includes program code, which when executed by a processor, causes an apparatus to receive one or more pretrained deep learning models, each pretrained deep learning model associated with a source domain, receive image data to be labeled, and input the received image data to each of the one or more pretrained deep learning models. The program code, when executed by the processor, further causes the apparatus to perform an adaptation on one or more of the pretrained deep learning models, provide, from each of the pretrained deep learning models, an output in a target domain for the pretrained deep learning model, provide an ensemble output, the ensemble output comprising labels for the image data determined based on the outputs from the pretrained deep learning models, and when the target domain is not completely covered by the source domain associated with a pretrained deep learning model of the one or more pretrained deep learning models, perform transfer learning on the pretrained deep learning model.
In a fourth embodiment, an apparatus for contextual execution comprises a processor, and a memory containing instructions, which when executed by the processor, cause the apparatus to receive, from a user terminal, a control input associated with an intent, obtain location data associated with a location of the user terminal, and determine a scored set of execution options associated with the control input. Further, the instructions, when executed by the processor, cause the apparatus to obtain a contextual label associated with the location data, the label determined based on the application of one or more adapted pretrained deep learning models to the location data, rescore the set of execution options associated with the control input based on the contextual label, and provide the highest scored execution option to a processor of the user terminal.
In a fifth embodiment, a method for contextual execution includes receiving, from a user terminal, a control input associated with an intent, obtaining location data associated with a location of the user terminal, and determining a scored set of execution options associated with the control input. The method further includes obtaining a contextual label associated with the location data, the label determined based on the application of one or more adapted pretrained deep learning models to the location data, rescoring the set of execution options associated with the control input based on the contextual label and providing the highest scored execution option to a processor of the user terminal.
In a sixth embodiment, a non-transitory computer-readable medium includes program code, which when executed by a processor, causes an apparatus to receive, from a user terminal, a control input associated with an intent, obtain location data associated with a location of the user terminal, and determine a scored set of execution options associated with the control input. The program code, when executed by the processor, further causes the apparatus to obtain a contextual label associated with the location data, the label determined based on the application of one or more adapted pretrained deep learning models to the location data, rescore the set of execution options associated with the control input based on the contextual label, and provide the highest scored execution option to a processor of the user terminal.
Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The term “couple” and its derivatives refer to any direct or indirect communication between two or more elements, whether or not those elements are in physical contact with one another. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both direct and indirect communication. The terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrase “associated with,” as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like. The term “controller” means any device, system or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. The phrase “at least one of,” when used with a list of items, means that different combinations of one or more of the listed items may be used, and only one item in the list may be needed. For example, “at least one of: A, B, and C” includes any of the following combinations: A, B, C, A and B, A and C, B and C, and A and B and C.
Moreover, various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code. The phrase “computer readable program code” includes any type of computer code, including source code, object code, and executable code. The phrase “computer readable medium” includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), or any other type of memory. A “non-transitory” computer readable medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals. A non-transitory computer readable medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.
Definitions for other certain words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many if not most instances, such definitions apply to prior as well as future uses of such defined words and phrases.
For a more complete understanding of this disclosure and its advantages, reference is now made to the following description, taken in conjunction with the accompanying drawings, in which:
As shown in
Applications 162 can include games, social media applications, applications for geotagging photographs and other items of digital content, virtual reality (VR) applications, augmented reality (AR) applications, operating systems, device security (e.g., anti-theft and device tracking) applications or any other applications which access resources of apparatus 100, the resources of apparatus 100 including, without limitation, speaker 130, microphone 120, input/output devices 150, and additional resources 180. Further, applications 162 may include an intelligent assistant application 163, an image recognition application 165 and a voice recognition application 167. According to various embodiments, intelligent assistant application 163 operates as an orchestrator and execution manager for other applications 162 operating on apparatus 100. For example, intelligent assistant 163 may receive outputs (such as data or method calls) from voice recognition application 167 and forward them as inputs to another application, such as an internet browser or data service application (for example, a weather application). Thus, intelligent assistant 163 can, for example, orchestrate the apparatus' response to certain voice activated commands or requests, such as a user saying, “What's the weather?” or “Turn on living room lights.”
The communication unit 110 may receive an incoming RF signal, for example, a near field communication signal such as a Bluetooth® or Wi-Fi® signal. The communication unit 110 can down-convert the incoming RF signal to generate an intermediate frequency (IF) or baseband signal. The IF or baseband signal is sent to the RX processing circuitry 125, which generates a processed baseband signal by filtering, decoding, or digitizing the baseband or IF signal. The RX processing circuitry 125 transmits the processed baseband signal to the speaker 130 (such as for voice data) or to the main processor 140 for further processing (such as for web browsing data, online gameplay data, notification data, or other message data).
The TX processing circuitry 115 receives analog or digital voice data from the microphone 120 or other outgoing baseband data (such as web data, e-mail, or interactive video game data) from the main processor 140. The TX processing circuitry 115 encodes, multiplexes, or digitizes the outgoing baseband data to generate a processed baseband or IF signal. The communication unit 110 receives the outgoing processed baseband or IF signal from the TX processing circuitry 115 and up-converts the baseband or IF signal to an RF signal for transmission.
The main processor 140 can include one or more processors or other processing devices and execute the OS program 161 stored in the memory 160 in order to control the overall operation of the apparatus 100. For example, the main processor 140 could control the reception of forward channel signals and the transmission of reverse channel signals by the communication unit 110, the RX processing circuitry 125, and the TX processing circuitry 115 in accordance with well-known principles. In some embodiments, the main processor 140 includes at least one microprocessor or microcontroller.
Additionally, in some embodiments operating system 161 is capable of providing “secure world” and “normal world” execution environments for applications 162.
The main processor 140 is also capable of executing other processes and programs resident in the memory 160. The main processor 140 can move data into or out of the memory 160 as required by an executing process. In some embodiments, the main processor 140 is configured to execute the applications 162 based on the OS program 161 or in response to inputs from a user or applications 162. Applications 162 can include applications specifically developed for the platform of apparatus 100, or legacy applications developed for earlier platforms. The main processor 140 is also coupled to the I/O interface 145, which provides the apparatus 100 with the ability to connect to other devices such as laptop computers and handheld computers. The I/O interface 145 is the communication path between these accessories and the main processor 140.
The main processor 140 is also coupled to the input/output device(s) 150. The operator of the apparatus 100 can use the input/output device(s) 150 to enter data into the apparatus 100. Input/output device(s) 150 can include keyboards, touch screens, mouse(s), track balls or other devices capable of acting as a user interface to allow a user to interact with apparatus 100. In some embodiments, input/output device(s) 150 can include a touch panel, a virtual reality headset, a (digital) pen sensor, a key, or an ultrasonic input device.
Input/output device(s) 150 can include one or more screens, which can be a liquid crystal display, light-emitting diode (LED) display, an optical LED (OLED), an active matrix OLED (AMOLED), or other screens capable of rendering graphics.
The memory 160 is coupled to the main processor 140. According to certain embodiments, part of the memory 160 includes a random access memory (RAM), and another part of the memory 160 includes a Flash memory or other read-only memory (ROM). In the non-limiting example of
Although
According to certain embodiments, apparatus 100 includes a variety of additional resources 180 which can, if permitted, be accessed by applications 162. According to certain embodiments, resources 180 include an accelerometer or inertial motion unit 182, which can detect movements of the electronic device along one or more degrees of freedom. Additional resources 180 include, in some embodiments, a user's phone book 184, one or more cameras 186 of apparatus 100, and a global positioning system 188. In the non-limiting example of
Although
According to various embodiments, network context 200 includes a context server 205, an artificial intelligence service server 220, a client device 230, and crowd-worker devices 225a, 225b and 225c.
In the non-limiting example of
Additionally, in certain embodiments, context server 205 can crowd-source the implementation of models (for example, models 210 in
According to various embodiments, context server 205 runs models 210 as an analytical adjunct to an artificial intelligence (AI) service provided by a third party (for example, a service provided by AI service server 220). In the non-limiting example of
In some embodiments, annotation engine 215 acts as an ingestion pipeline for aggregating crowd-sourced predictions of labels to be assigned to images from a particular location.
In the non-limiting example of
In certain embodiments, crowd workers 225a-225c are apparatus (for example, apparatus 100 in
According to various embodiments, network context 200 includes a client device 230. In the non-limiting example of
In certain embodiments, client device 230 may also be a client worker.
Although
Similarly, while
In the non-limiting example of
According to various embodiments, first layer 305 comprises one or more pretrained deep learning models 320. Further, in the non-limiting example of
As will be discussed in greater detail herein, last layer 325d can be a “loss layer” or a layer implementing a SoftMax classifier. Layers 325a through 325d can also include, without limitation, pooling layers, fully-connected layers and convolution layers.
In the non-limiting example of
According to certain embodiments, deep learning model 320 is pre-trained on a dataset specifically developed for an contextual labeling application (for example, determining contextual labels from image data). However, sufficiently training a deep neural network to avoid overfitting frequently requires very large datasets and it may be impractical to develop a dataset for training an entire deep neural network from scratch. Accordingly, in some other embodiments, it may be desirable to instead pre-train deep learning model 320 on a very large standard dataset (for example, ImageNet, which contains 1.2 million images with 1000 categories, or the Places dataset, which contains 500,000 images with 205 categories) and use resulting model either as an initialization or a fixed feature extractor to build a final model. According to still other embodiments, deep learning model is trained on a dataset generated by crowd worker devices (for example, crowd workers 225a-225c in
In some embodiments according to this disclosure, second layer 310 comprises an adaptation layer.
According to some embodiments, the classifiers for each deep learning model in first layer 305 belong to a problem space (also referred to as a “source domain”) specific to the model. Depending on embodiments, the source domains of the pre-trained models in first layer 305 may define a different classifier space to the space defined by the contextual labels to be output by pipeline 300 (also referred to as a “target domain”). For example, first layer 305 may include a pre-trained deep learning model which outputs 1000 class labels, of which, only some arbitrary subset are applicable to identifying a context of interest (for example, a location). In such cases, model adaptation layer 310 adapts the classifier space of the deep learning model to the target domain.
According to certain embodiments, domain adaptation comprises label space adaptation, wherein the output of the final layer of a pretrained deep learning model (for example, final layer 325d shown in
According to one non-limiting example, final layer 325d of pretrained deep learning model 320 implements a SoftMax classifier. The operation of final layer 325d can, according to certain embodiments, be represented by the following loss function:
In the equation above, Xi is the feature vector extracted by the deep neural network for the input sample i (captured single image). wi is the weight learned by the neural network. y is the predicted class label in j E N the set of all the class labels a pre-trained model is trained on (the source domain).
In the non-limiting example of
As shown above, 1(.) is the identity function and is the label-set of the application the pre-trained model is adopted for (the target domain). The denominator is the normalization factor and thus Ps(y=j|Xi) indicates the probability of class (label) given the feature vector Xi for application specific labels j∈
. Thus, for a pre-trained model
with the label space
in the source domain, the above approach the model
for the target application with label space
⊂
.
According to some other embodiments, the differences between the space of the source domain and the target domain are such that model adaptation following the Bayesian chain rule may not be possible. Examples of such embodiments include embodiments where the target domain comprises a subset of the source domain (e.g., where ⊂
). One example of embodiments where
⊂
is the case where there are class labels in
which do not have any representation is the source domain (for example, where the label “computer store” is in the space
of the target domain, but there is no corresponding label for “computer store” in the source domain). A further, example an embodiment where
⊂
includes instances where there are class labels in the space
of the target domain which match with multiple class labels in the source domain. For example, in some embodiments, IL includes the contextual location label “shoe-shop.” In cases where the AlexNet-ImageNet model defines the source domain
, there are multiple labels (for example, “shoe, “loafer shoe” and “sport shoe”).
In certain embodiments, adaptation layer 310 addresses situations where ⊂
by performing transfer learning. According to some embodiments, the feature extracting, or convolutional layers (for example, layer 325b) of pretrained deep learning model 320 are kept “frozen” by setting the learning rate for those layers to zero. Additionally, a last fully connected layer of the model is initiated with random weights and then trained on an additional data set labeled according to classifiers in the space
. According to such embodiments, the previously trained feature extractors of the pretrained model are retained, while the final fully connected layers extend the model to cover the entirety of space
. According to some embodiments, such model extension allows for training a deep learning model using a limited amount of training data, while at the same, avoids overfitting. According to certain embodiments, besides utilizing the learning rate and structure of output layer 325d, other network hyper-parameters are taken from the base model 320. In one exemplary embodiment, a Rectified Linear Unit (ReLU) function is used as the activation function in each convolution layer interleaved with pooling layers.
According to certain embodiments, third layer 315 of pipeline 300 comprises an ensemble output layer. The accuracy with which pipeline 300 predicts contextual labels for input data can, in certain embodiments, be enhanced by performing ensemble modeling of the outputs deep learning models 305 as adapted by adaptation layer 310. In some embodiments, performing ensemble modeling comprises determining a weighted average of the prediction probabilities of each of the pretrained models in first layer 305, as adapted or extended in second layer 310.
According to certain other embodiments, third layer 315 of pipeline 300 further comprises aggregating the ensemble output from one or more pretrained deep learning models. For example, in certain embodiments, instead of running a single image from a location k, through pipeline 300, a set of images from location k (represented as Γk) are sent through labeling pipeline 300. In such embodiments, the predictions for each image i in Γk may be aggregated by applying the function:
Wherein PI(y=j|Xik) is the prediction probability result by our ensemble of deep neural network models for classifying an image i obtained (for example, by a crowd-worker device) at location k, represented by feature vector Xik. As noted above, Γk is the set of all images obtained at location k, and PΓ(y=l|Γk) is an aggregated prediction of the probability of label y across all images for location k.
According to certain embodiments, once the aggregated probability of each label y across all images in Γk has been determined, a final label for each location k may be obtained by selecting the label with the maximum aggregated probability. In one exemplary embodiment, the final label “labelk” may be selected by applying the function:
As discussed elsewhere in this disclosure, method 400 may be practiced across a variety of apparatus and networking contexts. In some embodiments, operations of method 400 may be practiced on a single chip in a single apparatus. According to other embodiments, the operations of method 400 may be practiced across multiple machines, such as a network of machines embodying a server-client paradigm.
In the non-limiting example of
According to certain embodiments, method 400 includes operation 405 wherein an apparatus (as one non-limiting example, apparatus 100 in
In some embodiments, method 400 includes operation 410, wherein the apparatus receives one or more items of image data to be contextually labeled. In some embodiments, the image data may be received from a single source (for example, a camera on the apparatus implementing the processing pipeline). In other embodiments, the image data may be received from a plurality of sources (for example, crowd workers 225a-225c in
In the non-limiting example of
According to some embodiments, method 400 includes operation 420, wherein the apparatus performs adaptation on each of the pretrained deep learning models. According to some embodiments, such as where the label space IL of the target domain is a subset of the label space for the target domain, (e.g., where
⊂
), adaptation may, as discussed elsewhere in this disclosure, be performed by applying a Boolean filter to disregard labels in
which are not in
. In some embodiments, adaptation may be performed at operation 420 by applying the function:
In at least one embodiment, ⊂
and at operation 425, the apparatus performs transfer learning on each of the pretrained deep learning models which do not satisfy the condition
⊂
. According to certain embodiments, performing transfer learning comprises “freezing” the feature extracting, or convolutional layers, of the pretrained deep learning model 320 by setting the learning rate for those layers to zero. Additionally, a last fully connected layer of the model is initiated with random weights and then trained on an additional data set labeled according to classifiers in the space
. According to such embodiments, the previously trained feature extractors of the pretrained model are retained, while the final fully connected layers extend the model to cover the entirety of space
.
In certain embodiments, method 400 includes operation 430 wherein a result in the target domain is obtained from each of the adapted pretrained deep learning models.
According to some embodiments, at operation 435, the apparatus provides an ensemble output, wherein the ensemble output comprises a weighted average of prediction probabilities for labels in the target domain based on the outputs from each of the adapted deep learning models. In the non-limiting example of
While the non-limiting example of
According to certain embodiments, method 500 includes operation 505 wherein an apparatus receives from a user terminal, a control input associated with an intent. In certain embodiments, the apparatus is the user terminal. In other embodiments, the apparatus may be a different computing platform, such as, for example a back-end server connected via a network (for example, the internet) to the user terminal. The control input may be received as a typed command, a gesture, or a verbal utterance associated with an execution option which can be performed by a processor of the user terminal. The intent associated with the control input, may in some embodiments, understood, or better understood with some awareness of the context of the user terminal. As one non-limiting example, a person may provide as a user input to her user terminal, the spoken query “What's the status of my order?” The user may also have installed a number of applications on her user terminal which provide order-related execution options (for example, a retail shopping application, and an application associated with a coffee shop). In this example, the probability of selecting the execution option (for example, opening the retail application or opening the coffee shop application) associated with the control input is improved with an awareness of the context of the user terminal. If the user terminal can determine that it its current location is likely a coffee shop, the coffee shop application is likely the correct execution option.
In some embodiments, at operation 510, the apparatus obtains location data associated with the location of the user terminal. Location data includes, without limitation, image data, network connectivity data (for example, an identification of active Wi-Fi hotspots at a given location) and GPS data. In this particular example, the term “location of the user terminal” encompasses the physical location of a user terminal (for example, a smartphone) as a source of contextual signals as to the user's intent. As discussed above, the knowledge that the user's smartphone is in a coffee shop provides a contextual signal that opening the coffee shop application is likely the execution option best aligned with the intent of the user's control input. However, the present disclosure is not so limited, and method 500 is equally operable where the “location of the user terminal” is a location in virtual, abstract or other non-terrestrial space. For example, a “location of the user terminal” may, for example, refer to the user's location in a graph of a social network. In such cases, the user's location in the graph of the social network may provide contextual signals as to the execution option best matching the intent of the user's control input. For example, if the user input is a spoken request to “Play me a good movie,” knowledge of the user's neighbors in a graph of a social network may provide context in selecting a movie to play on the user terminal.
In various embodiments, at operation 515, the user terminal determines a scored set of execution options associated with the control input. Returning to the example of an apparatus receiving “what's the status of my order?” as a control input, at operation 515, the apparatus determines “open coffee shop application” and “open retail shopping application” as members of the scored set of execution options associated with the control input.
In the non-limiting example of
Because deep learning models can be both computationally expensive and take up a lot of storage space, implementing such models on smartphones or other portable apparatus without rapidly draining batteries or consuming storage resources required for important user content (for example, photos, video and audio data) can present a technical challenge. In certain embodiments, the demands on the limited resources of mobile apparatus can be mitigated by “shrinking” the file size of the model by applying a quantization method which takes advantage of the weights format of a trained model. Such “shrinking” can be attained by, for example, quantizing each 32 bit floating value in a model's weight matrices to the closest (8 bit) integer number, resulting in an approximately 75% reduction in file size.
According to certain embodiments, at operation 525, the apparatus rescores the execution options in the set of execution options based on the contextual label obtained in operation 520. In one non-limiting example, the contextual label associated with the location data was “coffee shop,” thereby raising the score for the execution option “open coffee shop application.”
In the non-limiting example of
In the particular example of
As shown in the column titled “Architecture,” Models 1-8 varied structurally, with regard to the number of convolution and fully connected layers, and with regard to the number of elements in the source domain, which ranged from 9 classes for “Model 8” to 1183 classes for “Model 3.” In this particular example, the target domain comprised 26 labels corresponding to, inter alia, different retail contexts (for example, “shoe-store” and “supermarket”).
As shown in the column titled “Data Set,” Models 1-8 were generally trained on large data sets having more classes than the 26 labels in the target domain. For models whose source domain, N, did not cover all of the 26 classes in the target domain, L, adaptation by transfer learning was performed, as shown by the “→” icon in the “Data Set” column.
According to certain embodiments, method 800 includes operation 805 wherein an apparatus (for example, apparatus 100 in
In some embodiments, parsing image data at operation 805 comprises identifying features corresponding to devices (for example, legs, bases and other constituent parts of devices and objects in the scene). From the identified devices, a vector representing objects in the scene (for example, chairs, toasters and t.v. screens) can be compiled.
In the non-limiting example of
In some embodiments, if the target domain is not completely covered by the source domain associated with one or more convolutional neural networks, transfer learning is performed at operation 815. In the non-limiting example of
According to various embodiments, at operation 820, the deep learning models (in this non-limiting example, the convolutional neural networks) provide an output in the source domain. Depending on embodiments, the output provided at operation 820 may be an output from a single item of image data. In other embodiments, the output provided at operation 820 may be an aggregated output based on multiple pieces of image data from the location.
In some embodiments, the output in the target domain includes location attention vector 825. In the non-limiting example of
According to certain embodiments at operation 830, a bipartite graph mapping locations to devices is updated based on the output of operation 820. In some embodiments, the bipartite graph comprises a mapping of the edges between members of a set of contextual location labels and devices. Further, according to some embodiments, each edge between labels and devices is assigned a weighting based on the determined correspondence between the contextual label and the device. Thus, at operation 830, the edge weightings of the bipartite graph are updated based on the output of operation 820.
In the non-limiting example of
According to some embodiments, the output of operation 835 comprises location-device matrix 840. As with location attention vector 825, upward diagonal cross-hatching is used to show entries of location-device matrix 840 associated with the contextual location label “supermarket.”
According to certain embodiments, an apparatus (for example apparatus 100 in
As shown in
As shown in the non-limiting example of
According to certain embodiments, an apparatus (for example apparatus 100 in
As shown in
As shown in the non-limiting example of
In the non-limiting example of
According to certain embodiments, initial intelligent assistant determination 1005 determination 1005 comprises a vector of probabilities mapped to execution options of devices at a user's location, as initially determined by an intelligent assistant (for example AI service 220) based on the intelligent assistant's analysis of the user's intention from a control input, without regard to any contextual labeling. In the non-limiting example of
As a simple example, a user provides the spoken input “What's the status of my order?” The possible execution options associated with the user input include “open coffee shop application” and “open retail shopping application.” According to some embodiments, initial intelligent assistant determination 1005 may comprise a vector having values of 0.3 for the execution options “open coffee shop application” and “open retail shopping application.” In this example values of the vector are probabilistic scores as to whether an execution option correctly aligns with the user's intention. In this particular example, the apparatus is able to determine that both of these options are likely, but does not have any contextual information to prefer the coffee shop application over the retail shopping application.
In some embodiments, attention vector 1010 comprises, an ensemble output of a pipeline (for example, pipeline 300 in
According to various embodiments, location-intention matrix 1015 comprises a matrix of probabilities of execution options (or user intentions) across contextual location labels. In the non-limiting example of
In the non-limiting example of
According to certain embodiments, device capability matrix 1025 comprises a mapping of probabilities that a device obtained by parsing image data for a location i is associated with a particular execution option or user intent. In the non-limiting example of
According to certain embodiments, the scores comprising vector uu*1 comprising initial intelligent assistant determination 1005 can be recalculated, or “rescored” by applying the function R(u) as shown below:
R(u)=[Q.conc(a)]T·MT·YT·u
Wherein MT is the transpose of location-device matrix 1020), [Q.conc(a)]T is the transpose of the dot product of the user-intention location matrix Q, and the concatenation of location attention vector a 1010, and YT is the transpose of device-capability matrix 1025.
Application of R(u) results in the vector 1030, which is a vector in the same space as initial intelligent assistant determination 1005, but in which the constituent probabilities of user intentions are rescored based on contextual information provided by, without limitation, location-device matrix 1020.
None of the description in this application should be read as implying that any particular element, step, or function is an essential element that must be included in the claim scope. The scope of patented subject matter is defined only by the claims. Moreover, none of the claims is intended to invoke 35 U.S.C. § 112(f) unless the exact words “means for” are followed by a participle.
This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 62/471,000 filed on Mar. 14, 2017. The above-identified provisional patent application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62471000 | Mar 2017 | US |