Documents formatted in a portable document format (PDF) are used to simplify the display and printing of structured documents. These PDF documents permit incorporation of a text and graphics in a manner that provides consistency in the display of documents across heterogeneous computing environments. In addition, it is often necessary to extract text and/or other information from a document encoded as a PDF to perform various operations. For example, text and location information can be extracted to determine an entity associated with the document. To optimize such tasks, existing tools (e.g., natural language models) focus on a single region of the document, which ignores inter-region information and provides sub-optimal results when extracting information from other regions. In addition, multiple models may be required to extract information from multiple regions, leading to increased cost and maintenance.
Embodiments described herein are directed to determining information from a PDF document based at least in part on relationships and other data extracted from a plurality of granularities of the PDF document. As such, the present technology is directed towards generating and using a multi-modal multi-granular model to analyze various document regions of different granularities or sizes. To accomplish the multi-granular aspect, the machine learning model analyzes components of a document at different granularities (e.g., page, region, token, etc.) by generating an input to the model that includes features extracted from the different granularities. For example, the input to the multi-modal multi-granular model includes a fixed length feature vector including features and bounding box information extracted from a page-level, region-level, and token-level of the document. With regard to the multi-modal aspect, a machine learning model analyzes different types of features (e.g., textual, visual features, and/or other features) associated with the document. As one example, the machine learning model analyzes visual features obtained from a convolutional neural network (CNN) and textual features obtained using optical character recognition (OCR), transforming such features first based on self-attention weights (e.g., within a single modality or type of feature) and then based on cross-attention weights (e.g., between modalities or types of features). These transformed feature vectors can then be provided to other machine learning models to perform various tasks (e.g., document classification, entity recognition, token recognition, etc.).
The multi-modal multi-granular model provides a single machine learning model that provides optimal results used for performing subsequent tasks, thereby reducing training and maintenance costs required for the machine learning models to perform these subsequent tasks. For example, the multi-modal multi-granular model is used with a plurality of different classifiers thereby reducing the need to train and maintain separate models. Furthermore, the multi-modal multi-granular model is also capable of detecting and/or obtaining context information or other information across regions and/or levels of the document. For example, based at least in part on the multi-modal multi-granular model processing inputs at multiple levels and/or regions of the document, the multi-modal multi-granular model determines a parent-child relationship between distinct regions of the document.
It is generally inefficient and inaccurate to have a single machine learning model to extract or otherwise determine information from a document. In many cases, these models are trained using only a single level or granularity (e.g., page, region, token) of a document and therefore are inefficient and inaccurate when determining information at a granularity other than the granularity at which the model was trained. In some examples, an entity recognition model is trained on data extracted from a region granularity of a document and is inefficient and inaccurate when extracting information from a page granularity or token granularity and, therefore, provides suboptimal results when information is included at other granularities. In addition, these conventional models are trained and operated in a single modality. In various examples, a model trained on tokens that comprises characters and words (e.g., a first modality) is ineffective at extracting information from images (e.g., a second modality).
Furthermore, training these conventional models based on a single granularity prevents the models from determining or otherwise extracting information between and/or relating different granularities. For example, conventional models are unable to determine relationships between granularities such as parent-child relationships, relationships between elements of a form, relationship between lists of elements, and other relationships within granularities and/or across granularities. Based on these deficiencies, it may be difficult to extract certain types of information from documents. In addition, conventional approaches may require the creation, training, maintenance, and upkeep of a plurality of models to perform various tasks. Creation, training, maintenance, and upkeep of multiple models consumes a significant amount of computing resources.
Accordingly, embodiments of the present technology are directed towards generating and using a multi-modal multi-granular model to analyze document regions of multiple sizes (e.g., granularities) and generate data (e.g., feature vectors) suitable for use in performing multiple tasks. For example, the multi-modal multi-granular model can be used in connection with one or more other machine learning models to perform various tasks such as the page-level document extraction, region-level entity recognition, and/or token-level token classification. The multi-modal multi-granular model takes as an input features extracted from a plurality of regions and/or granularities of the document—such as document, page, region, paragraph, sentence, and word granularities—and outputs transform features that can be used, for example, by a classifier or other machine learning model to perform a task. In an example, the input includes textual features (e.g., tokens, letters, numbers, words, etc.), image features, and bounding boxes representing regions and/or tokens from a document (e.g., page, paragraph, character, word, feature, image, etc.).
In this regard, an input generator of the multi-modal multi-granular tool, for example, generates a semantic feature vector and a visual feature vector which are in turn used as inputs to a uni-modal encoder (e.g., of the multi-modal multi-granular model) which transforms the semantic feature vector and the visual feature vector, as described in greater detail below, the transformed semantic feature vector and visual feature vector are provided as an input to a cross-modal encoder of the multi-modal multi-granular model to generate attention weights (e.g., self-attention and cross-attention) associated with the semantic features and visual features. In various examples, the information generated by the multi-modal multi-granular model (e.g., the feature vectors including the attention weights) can be provided to various classifiers to perform various tasks (e.g., such as document classification, entity recognition, token recognition, etc.). As described above, conventional technologies typically focus on a single region of the document, thereby providing sub-optimal results when extracting information from another region and/or determining information across regions.
As described above, for example, the multi-modal multi-granular model receives inputs generated based on regions of multiple granularity (e.g., whole-page, paragraphs, tables, lists, form components, images, words, and/or tokens). In addition, in various embodiments, the multimodal multi-granular model represents alignments between regions that interact spatially through a self-attention alignment bias and learns multi-granular alignment through an alignment loss function. In various embodiments, the multi-modal multi-granular model includes multi-granular input embeddings (e.g., input embedding across multiple granularities generated by the input generator as illustrated in
In various embodiments, document extraction is performed, by at least analyzing regions of different sizes within the document. Furthermore, by analyzing regions of different sizes within the document, the multi-modal multi-granular model, for example, can be used to perform relation extraction (e.g., parent-child relationships in forms, key-value relationships in semi-structured documents like invoices and forms), entity recognition (e.g., detecting paragraphs for decomposition), and/or sequence labeling (e.g., extracting dates in contracts) by at least analyzing regions of various sizes including an entire page as well as individual words and characters. In some examples, document classification analyzes the whole page, relation extraction and entity recognition analyze regions of various sizes, and sequence labeling analyzes individual words.
The multi-modal multi-granular model, advantageously, generates data that can be used to perform multiple distinct tasks (e.g., entity recognition, document classification, etc.) at multiple granularities which reduces model storage cost and maintenance as well as improves performance over conventional systems as a result of the model obtaining information from regions at different granularities. In one example, the multi-modal multi-granular model obtains information from a table of itemized costs (e.g., coarse granularity) when looking for a total value (e.g., fine granularity) in an invoice or receipt. In other examples, tasks require data from multiple granularities—such as determining parent child relationships in a document (e.g., checkboxes in a multi-choice checkbox group in a form) which requires looking at the parent region and child region at different granularities. As described in greater detail below in connection with
Advantageously, the multi-modal multi-granular model provides a single model that, when used with other models, provides optimal results for a plurality of tasks thereby reducing training and maintenance costs required for the models to perform these tasks separately. To put in other words, the multi-modal multi-granular model provides a single model that generates an optimized input to other models to perform tasks associated with the document thereby reducing the need to maintain multiple models. Furthermore, the multi-modal multi-granular model is also capable of detecting and/or obtaining context information or other information across regions and/or levels of the document. This context information or other information across regions and/or levels of the document is generally unavailable to conventional models that take as an input features extracted from a single granularity.
Turning to
It should be understood that operating environment 100 shown in
It should be understood that any number of devices, servers, and other components may be employed within operating environment 100 within the scope of the present disclosure. Each may comprise a single device or multiple devices cooperating in a distributed environment.
User device 102 can be any type of computing device capable of being operated by an entity (e.g., individual or organization) associated with a document 120 from which information is to be extracted and/or one or more tasks are to be performed (e.g., entity recognition, document classification, sequence labeling, etc.). The user device 102, in various embodiments, has access to or otherwise maintains documents (e.g., the document 120) from which information is to be extracted. In some implementations, user device 102 is the type of computing device described in relation to
The user device 102 can include one or more processors, and one or more computer-readable media. The computer-readable media may include computer-readable instructions executable by the one or more processors. The instructions may be embodied by one or more applications, such as application 108 shown in
The application(s) may generally be any application capable of facilitating the exchange of information between the user device 102 and the multi-modal multi-granular tool 104 in carrying out one or more tasks that include information extracted from the document 120. In some implementations, the application(s) comprises a web application, which can run in a web browser, and could be hosted at least partially on the server-side of environment 100. In addition, or instead, the application(s) can comprise a dedicated application, such as an application being supported by the user device 102 and the multi-modal multi-granular tool 104. In some cases, the application is integrated into the operating system (e.g., as a service). It is therefore contemplated herein that “application” be interpreted broadly. Some example applications include ADOBE® SIGN, a cloud-based e-signature service, and ADOBE ACROBAT®, which allows users to view, create, manipulate, print, and manage documents.
In accordance with embodiments herein, the application 108 facilitates the generation of an output 122 of a multi-modal multi-granular model 126 that can be used to perform various tasks associated with the document 120. For example, user device 102 may provide the document 120 and indicate one or more tasks to be performed by a second machine learning model based on the document 120. In various embodiments, the second machine learning model includes various classifiers as described in greater detail below. Although, in some embodiments, a user device 102 may provide the document 120, embodiments described herein are not limited hereto. For example, in some cases, an indication of various tasks that can be performed on the document 120 may be provided via the user device 102 and, in such cases, the multi-modal multi-granular tool 104 may obtain such the document 120 from another data source (e.g., a data store).
The multi-modal multi-granular tool 104 is generally configured to generate the output 122 which can be used by one or more task models 112, as described in greater detail below, to perform various tasks associated with the document 120. For example, as illustrated in
In various embodiments, the input generator 124 provide the generated input to the multi-modal multi-granular model 126 and, based on the generated input, the multi-modal multi-granular model 126 generates the output 122. As described in greater detail in connection with
For cloud-based implementations, the application 108 may be utilized to interface with the functionality implemented by the multi-modal multi-granular tool 104. In some cases, the components, or portion thereof, of multi-modal multi-granular tool 104 may be implemented on the user device 102 or other systems or devices. Thus, it should be appreciated that the multi-modal multi-granular tool 104 may be provided via multiple devices arranged in a distributed environment that collectively provide the functionality described herein. Additionally, other components not shown may also be included within the distributed environment.
Turning to
In various embodiments, bounding boxes, features, and other information are extracted from a document and provided to the input generator 224, which generates two input feature vectors (e.g., fixed-length feature vectors), a first feature vector corresponding to textual contents of the document (illustrated with an “S”) and a second feature vector corresponding visual contents of the document (illustrated with a “V”). For example, at the page level 204, region level 206, and/or word level 208, data from the document (e.g., a page of the document) is extracted and the textual content is provided to a sentence encoder to generate the corresponding semantic feature vector for the particular granularity from which the data was extracted. Furthermore, in such an example, a CNN or other model generates a visual feature vector based at least in part on data extracted from the particular granularity. In various embodiments, the same models and/or encoders are used to generate input feature vectors for the page level 204, the regions level 206, and the word level 208. In other embodiments, different models and/or encoders can be used for one or more granularities (e.g., the page level 204, the regions level 206, and the word level 208). Furthermore, the data extracted from the document, in an embodiment, is modified by the input generator 224 during generation of the semantic feature vector (“S”) and the visual feature vector (“V”). In one example, a CNN suggests bounding boxes that are discarded by the input generator 224. In another example, as described in greater detail below in connection with
In an embodiment, the textual contents and bounding boxes of regions and tokens (e.g., words) of the document are obtained from one or more other applications. In addition, in various examples, regions refer to larger areas in the page which contain several words. Furthermore, the bounding boxes, in an embodiment, include rectangles enclosing an area of the document (e.g., surrounding a token, region, word, character, page, etc.) represented by coordinates values for the left-top and bottom-right of the bounding box. In such embodiments, these coordinates are normalized with the height and width of the page and rounded to an integer value. In some embodiments (e.g., where memory may be limited), a sliding window is used to select tokens, such that the tokens are in a cluster and can provide contextual information.
Once the input generator 224 has generated the input feature vectors, in various embodiments, the feature vectors are provided to a uni-modal encoder 210 and transformed, encoded, or otherwise modified to generate output feature vectors. In one example, self-attention weights are calculated for the input feature vectors based on features within a single modality. In an example, the self-attention weights include a value that represents an amount of influence features within a single modality have on other features (e.g., influence when processed by one or more task models). In various embodiments, the self-attention is calculated based on the following formula:
where X represents the features of a single modality (e.g., semantic or visual features), A represents an alignment bias matrix 218, and R represents a relative distance bias matrix containing values calculated based at least in part on the distance between the bounding box of the features. In an embodiment, the alignment bias matrix 218 provides an indication that a particular word, token, and/or features is within a particular region (e.g., page, region, sentence, paragraph, word, etc.). In the example illustrated in
Although the relationship between the token (e.g., “W1”) and the region (e.g., “R1”) is described as within in connection with
In an embodiment, the uni-modal encoder 210 adds or otherwise combines the self-attention weights, the alignment bias matrix 218, and the relative distance between features to transform (e.g., modify the features based at least in part on values associated with the self-attention weights, alignment bias, and relative distance) the set of features (e.g., represented by “S” and “V” in
In various embodiments, the output of the uni-modal encoder 210 is provided to a cross-modal encoder 212 which determines cross-attention values between and/or across modalities. In one example, the cross-attention values for the semantic feature vectors are determined based on visual features (e.g., values included in the visual feature vector). In various embodiments, the cross-attention values are determined based on the following equations:
where S represents a semantic feature and V represents a visual feature, and the two features (e.g., FeatS and FeatV) are concatenated to generate the output feature included in the output feature vector. In an embodiment, the cross-attention values are calculated based on the dot production of multi-modal features (e.g., semantic and visual features). Furthermore, in various embodiments, the output of the cross-modal encoder 212 is a set of feature vectors (e.g., output feature vectors which are the output of the multi-modal multi-granular model 226) including transformed features, the transformed features corresponding to a granularity of the document (e.g., page, region, word, etc.). In an embodiment, the output of the cross-modal encoder 212 is provided to one or more machine learning models to perform one or more tasks as described above. For example, the semantic feature vector for the word-level granularity is provided to a machine learning model to label the features (e.g., words extracted from the document). In various embodiments, the set of input feature vectors generated by the input generator 224 and provided as an input to the uni-modal encoder 210, the uni-modal encoder 210 modifies the set of input feature vectors (e.g., modifies the values included in the feature vectors) to generate an output, the output of the uni-modal encoder 210 in provided as an input to the cross-modal encoder 212 which then modifies the output of the uni-modal encoder 210 (e.g., the set of feature vectors) to generate an output (e.g., the output set of feature vectors).
In various embodiments, during a pre-training phase, various pre-training operations are performed using the output 222 of the multi-model multi-granular model or components thereof (e.g., cross-modal encoder 212). In one example, a masked sentence model (MSM), masked vision model (MVM), and/or a masked language model (MLM) are used to perform pre-training operations. In addition, the pre-training operations, in various embodiments, include a multi-granular alignment model (MAM) to train the multi-model multi-granular model to use the alignment information (e.g., the alignment bias matrix 218) based on a loss function. For example, an alignment loss function can be used to penalize the multi-model multi-granular model and reinforce the multi-model multi-granular model use of the alignment relation. In various embodiments, as described in greater detail below in connection with
In regards to
Turning to
Furthermore, a computing device, in various embodiments, communicates with other computing devices via a network (not shown in
In various embodiments, a multi-modal multi-granular model generates and/or extracts data from the document 320 at one or more regions (e.g., granularities) of the document 320. In one example, the multi-modal multi-granular model generates a set of feature vectors used by one or more task machine learning models to perform document classification based on data obtained from the document 320 at a plurality of granularity levels (e.g., the page-level 302 granularity). As described in greater detail below in connection with
In an embodiment, an OCR model, CNN, and/or other machine learning model generates a set of input feature vectors based at least in part on the document 320, the set of input feature vectors are processed by the multi-modal multi-granular model and then provided, as set of output feature vectors (e.g., the result of the multi-modal multi-granular model processing the set of input feature vectors) to a document classification model to perform the document classification task. Similarly, when performing relation extraction tasks, the multi-modal multi-granular model generates a modified set of feature vectors (e.g., the set of output feature vectors) which are then used by one or more additional task models to extract relationships between regions and/or other granularities (e.g., words, pages, etc.). In the example illustrated in
Turning to
Turning now to
Furthermore, in the example illustrated in
As illustrated in
In various embodiments, the position embedding 526 includes information indicating the position of the feature relative to other features in the document. In one example, features are assigned a position value (e.g., 0, 1, 2, 3, 4, . . . as illustrated in
In addition, in the example illustrated in
In an embodiment, the alignment bias 618 is represented as a matrix where a first set of dimensions (e.g., rows or columns) represent portions and/or regions of the document across granularities (e.g., page, region, words) and a second set of dimensions represent features (e.g., tokens, words, image features etc.). In such embodiments, the value V0 is assigned to a position in the matrix if the feature A corresponding to the position is an ∈ region B corresponding to the position. Furthermore, in such embodiments, the value V1 is assigned to a position in the matrix if the feature A corresponding to the position is an ∉ region B corresponding to the position.
In various embodiments, during transformation of the input using attention-weights alignment bias 618 enables the multi-modal multi-granular model to encode relationships between features and/or regions. In addition, as described below in connection with
where A represents the alignment bias 618 and R represents the relative distance bias 614. In one example, to generate the alignment bias 618 the bounding boxes corresponding to regions are compared to bounding boxes corresponding to features to determine if a relationship (e.g., ∈) is satisfied. In various embodiments, if the relationship is satisfied (e.g., the word X is in the region Y), a value is added to the corresponding attention weight between the region and the feature. In such embodiments, the value added to the attention weight is determined such that the multi-modal multi-granular model can be trained based at least in part on the relationship.
In an embodiment, the relative distance bias 614 represents the distance between regions and features. In one example, relative distance bias 614 is calculated based at least in part on the distance between bounding boxes (e.g., calculated based at least in part on the coordinates of the bounding boxes). In various embodiments, the relative distance bias 614 (e.g., the value calculated as the distance between bounding boxes) is added to the attention weights 610 to strengthen the spatial expression. For example, attention weights 610 (including the alignment bias 618 and the relative distance bias 614) indicates to the multi-modal multi-granular model how much attention features should assign to other features (e.g., based at least in part on feature type, relationship, location, etc.). In various embodiments, the multi-modal multi-granular model includes a plurality of alignment biases representing various distinct relationships (e.g., inside, outside, above, below, right, left, etc.). In addition, in such embodiments, the plurality of alignment biases can be included in separate instances of the multi-modal multi-granular model executed in serial or in parallel.
In various embodiments, documents include a plurality of regions within different granularity levels as described above. In one example, a highest granularity level includes a page 704 of the document, a medium granularity level includes a region 706 of the document (e.g., a portion of the document less than a page), and a lowest granularity level includes a token 708 within the document (e.g., a word, character, image, etc.). The pre-training MSM task includes, in various embodiments, calculating the loss (e.g., L1 loss function) between the corresponding region output features and the original textual features. In yet other embodiments, the MSM pre-training task is performed using visual features extracted from the set of documents.
In an embodiment, the pre-training tasks include a multi-granular alignment model (MAM) to train the multi-modal multi-granular model 702 to use the alignment information included in the alignment bias 718. In one example, an alignment loss function is used to reinforce the multi-modal multi-granular model 702 representation of the relationship indicated by the alignment bias 718. In an embodiment, the dot product 712 between regions and tokens included in the output (e.g., feature vector) of the multi-modal multi-granular model 702 is calculated and binary classification performed to predict alignment. In various embodiments, the loss function includes calculating the cross entropy 710 between the dot product 712 and the alignment bias 718. In the MAM pre-training task, for example, a self-supervision task is provided to the multi-modal multi-granular model 702, where the multi-modal multi-granular model 702 is rewarded for identifying relationships across granularities and penalized from not identifying relationships (e.g., as indicated in the alignment bias 718).
In various embodiments, the multi-modal multi-granular model 702 is pre-trained and initialized with weights based on a training dataset (e.g., millions of training sample documents) and then used to process additional datasets to label the data and adapt the weights specifically for a particular task. In yet other embodiments, the weights are not modified after pre-training/training. Another pre-training task, in an embodiment, includes a mask language model (MLM). In one example, the MLM masks a portion of words in the input and predicts the missing word using the semantic output features obtained from the multi-modal multi-granular model 702.
In an example, a model can perform an analytics task which involves classifying a page 804 into various categories to obtain statistics about a collection analysis. In another example, the analytics task includes inferring a label about the page 804, region 806, and/or word 808. Another task includes information extraction to obtain a single value. In embodiments including information extraction, multi-modal multi-granular model 802 provides a benefit by at least modeling multiple granularities enabling the model performing the tasks to use contextual information from coarser or finer levels of granularity to extract the information.
In an embodiment, the output of the multi-modal multi-granular model 802 is used by a model to perform form field grouping which involves associating widgets and labels into checkbox form fields, multiple checkbox fields into choice groups, and/or classifying choice groups as single- or multi-select. Similarly, in embodiments including form field grouping, the multi-modal multi-granular model 802 provides a benefit by including relationship information in the output. In other embodiments, the task performed includes document re-layout (e.g., reflow) where complex documents such as forms have nested hierarchical layouts. In such examples, the multi-modal multi-granular model 802 enables a model to reflow documents (or perform other layout modification/editing tasks) based at least in part on the granularity information (e.g., hierarchical grouping of all elements of a document) included in the output.
Turning now to
At block 904, the system executing the method 900, modifies the feature vector based on a set of self-attention values. In an example, semantic features (e.g., features included in the feature vector) extracted from the document are transformed based on attention weights calculated based at least in part on other semantic features (e.g., included in the feature vector). In various embodiments, the self-attention values are calculated using the formula described above in connection with
At block 906, the system executing the method 900, modifies the feature vector based on a set of cross-attention values. In an example, semantic features (e.g., features included in the feature vector) extracted from the document are transformed based at least in part on attention weights calculated based at least in part on other features types (e.g., visual features included in a visual feature vector). In various embodiments, the cross-attention values are calculated using the formula described above in connection with
At block 908, the system executing the method 900, provides modified feature vectors to a model to perform a task. For example, as described above in connection with
Turning now to
At block 1004, the system executing the method 1000, trains the multi-model multi-granular model. In various embodiments, training the multi-model multi-granular model includes providing the multi-model multi-granular model with a set of training data objects (e.g., documents) for processing. For example, the multi-model multi-granular model is provided a set of documents including features extracted at a plurality of granularities for processing.
Having described embodiments of the present invention,
Computing device 1100 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 1100 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVDs) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 1100. Computer storage media does not comprise signals per se. Communication media typically embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media, such as a wired network or direct-wired connection, and wireless media, such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
Memory 1112 includes computer storage media in the form of volatile and/or nonvolatile memory. As depicted, memory 1112 includes instructions 1124. Instructions 1124, when executed by processor(s) 1114 are configured to cause the computing device to perform any of the operations described herein, in reference to the above discussed figures, or to implement any program modules described herein. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 1100 includes one or more processors that read data from various entities such as memory 1112 or I/O components 1120. Presentation component(s) 1116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc.
I/O ports 1118 allow computing device 1100 to be logically coupled to other devices including I/O components 1120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc. I/O components 1120 may provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instances, inputs may be transmitted to an appropriate network element for further processing. An NUI may implement any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with displays on computing device 1100. Computing device 1100 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these, for gesture detection and recognition. Additionally, computing device 1100 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of computing device 1100 to render immersive augmented reality or virtual reality.
Embodiments presented herein have been described in relation to particular embodiments which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present disclosure pertains without departing from its scope.
Various aspects of the illustrative embodiments have been described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced with only some of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to one skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well-known features have been omitted or simplified in order not to obscure the illustrative embodiments.
Various operations have been described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation. Further, descriptions of operations as separate operations should not be construed as requiring that the operations be necessarily performed independently and/or by separate entities. Descriptions of entities and/or modules as separate modules should likewise not be construed as requiring that the modules be separate and/or perform separate operations. In various embodiments, illustrated and/or described operations, entities, data, and/or modules may be merged, broken into further sub-parts, and/or omitted.
The phrase “in one embodiment” or “in an embodiment” is used repeatedly. The phrase generally does not refer to the same embodiment; however, it may. The terms “comprising,” “having,” and “including” are synonymous, unless the context dictates otherwise. The phrase “A/B” means “A or B.” The phrase “A and/or B” means “(A), (B), or (A and B).” The phrase “at least one of A, B and C” means “(A), (B), (C), (A and B), (A and C), (B and C) or (A, B and C).”