Embodiments generally relate to object recognition transformers. More particularly, embodiments relate to a semantic-guided transformer for object recognition and radiance field-based novel views.
Three-dimensional (3D) object recognition and radiance-field-based (RF-based) novel view synthesis are two active research areas in the 3D feature representation domain. The 3D object recognition task involves classifying objects situated in 3D space, which may be accomplished by analyzing either a set of images or point cloud data. The multiview-based approach, in particular, entails recognizing 3D objects by integrating information gathered from numerous two-dimensional (2D) views acquired from diverse perspectives.
Separately, radiance-field-based novel view synthesis is an intricate technique that captures and reconstructs the visual appearance of objects or scenes by employing a process that involves capturing images from varying view points and subsequently approximating the radiance field—a mathematical representation that describes the visual appearance of an object as a function of viewing direction. More generally, the field of radiance field (RF) based novel view synthesis is utilized to generate novel views of complex 3D scenes based on a partial set of 2D images.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
As discussed above, different solutions have been proposed for addressing specific challenges of the field of three-dimensional (3D) object recognition as compared to the separate field of radiance field (RF) based novel view synthesis.
In the field of 3D object recognition, some approaches have proposed a generative model for unsupervised identification of 3D shape structures, incorporating additional human-provided information, or developing weakly supervised techniques. Annotating additional data by humans, however, is costly, and the improvements achieved by these methods are limited.
Concerning RF-based novel view synthesis, the focus is often on the generalization ability due to robustness concerns and the proposal of cross-scene rendering. Some approaches rely on scalable external data to improve generalization ability. These approaches, however, require massive training data, and there is considerable room for improvement with respect to the learning efficiency of these models.
As will be described in greater detail below, a semantic-guided transformer based on a deep learning model that uses self-attention to process sequential input data (e.g., via a neural network) is provided for both object recognition and radiance-field-based novel view synthesis. The semantic-guided transformer for object recognition and radiance-field-based novel view synthesis is also referred to herein as “2R-TRM,” where 2R encompasses 3D object [R]ecognition and [R]adiance-field-based novel view synthesis). The semantic-guided transformer for object recognition and radiance-field-based novel view synthesis (2R-TRM) unifies these two tasks by integrating object semantic information into visual features from multiple viewpoints. This integration enhances the learning of latent features and underlying patterns. Additionally, a semantic understanding module is utilized to guide feature learning through the incorporation of noise contrastive estimation, in some examples.
The technology described herein integrates two formerly separate research tasks: 3D object recognition and radiance-field-based novel view synthesis, which aims to classify and represent 3D objects based on visual information from multiple viewpoints, respectively. The existence of this problem is driven by two factors. First, the techniques described herein can generate a 360-degree view video and provide semantic labels for commercial purposes. Secondly, both tasks involve understanding and recognizing 3D object materials, shapes, and structures from multiple viewpoints. The challenges in these tasks include accounting for scene/environment properties, such as lighting conditions, and addressing deficiencies such as cross-scene/object generalization and learning efficiency in radiance field representation. The specific problem of interest is how to effectively integrate these two tasks and leverage their mutual benefits, where radiance-field-based (RF-based) novel view synthesis enhances 3D object recognition by capturing crucial object details, while 3D object recognition provides semantic knowledge to aid radiance-field-based view synthesis learning.
The integration of these two tasks leads to mutual benefit. On the one hand, RF-based novel view synthesis yields superior 3D object representations, capturing crucial details pertaining to texture, shape, and structure of objects, thereby facilitating the model's ability to differentiate among various object categories. On the other hand, 3D object recognition endows RF-based novel view synthesis with semantic knowledge. As the semantic label embodies an abstract understanding and generalization of the object, combining the recognition task provides guidance to RF-based novel view synthesis learning, consequently enhancing model efficiency. In addition, the combination of two tasks realizes an application that can simultaneously generate 360-degree rendered video and recognize 3D objects. Further, involving object label prediction makes it possible to retrieve stock keeping unit (SKU) information.
Advantageously, by integrating the tasks of 3D object recognition and RF-based novel view synthesis, the semantic-guided transformer for object recognition and radiance-field-based novel view synthesis (2R-TRM) provides mutual benefits to both domains. The transformer (e.g., also referred to herein as a model) enhances the learning of latent features and underlying patterns, resulting in improved performance in 3D object recognition and RF-based novel view synthesis. The inclusion of semantic information further enhances the efficiency of the model.
Moreover, the combination of these tasks allows for the simultaneous generation of 360-degree rendered videos and 3D object recognition. This integration brings benefits by enabling the retrieval of stock keeping unit (SKU) information. This capability becomes particularly relevant in the context of augmented reality (AR) and virtual reality (VR) devices, where accurate and real-time object recognition coupled with immersive experiences is in high demand. The technology described herein, therefor effectively bridges the gap between rendering video applications and object label prediction, providing practical applications.
As will be described in greater detail below, a specific structural feature of the techniques described herein are the latent features extracted from the aggregation encoder. As used herein, the term “latent features” refers to features learned through the joint task of integrating 3D semantic object recognition and RF-based novel view synthesis, where the latent features are learned through the joint task of integrating 3D semantic object recognition and radiance field view synthesis to incorporate semantic information from 3D semantic object recognition to aid radiance field view synthesis rendering and incorporate radiance field information from radiance field view synthesis rendering of a 3D scene to enhance the 3D semantic object recognition. In the machine learning domain, latent features are defined as the underlying, non-explicit characteristics or patterns in the data that a model (e.g., in this case, the aggregation encoder) discovers or leverages during its learning process. These patterns can manifest as similar visual structures across different views and different instances within the same category. The latent features capture the underlying patterns and relationships between the two modalities by incorporating semantic information from recognition to aid rendering (e.g., radiance field view synthesis operations) and incorporating radiance field features from the 3D scene to enhance the recognition task (e.g., semantic object recognition operations). These latent features enable improved performance and providing valuable insights into the interplay between object recognition and novel view synthesis.
In some examples, the semantic-guided transformer 10 unifies the 3D object recognition and radiance-field-based view-synthesis/representation training, incorporating object semantic information into visual features from multiple viewpoints. For example, multi-view visual data 11 (e.g., RGB (red, green, blue) multi-view images) are encoded and fused by an aggregation encoder 12 to produce latent features 13, while a label decoder 14 and a rendering decoder 16 are utilized to infer object labels 15 and synthesize novel target views 17 based on target view directions, respectively.
For example, the semantic-guided transformer 10 includes two stages: encoding visual view features into latent features via aggregation encoder 12 and decoding them via label decoder 14 to obtain object labels 15 and via and rendering decoder 16 to obtain novel target views 17. The aggregation encoder 12 processes multi-view visual data 11 of numerous two-dimensional (2D) views acquired from diverse perspectives to create latent features 13 (e.g., including a coordinate-aligned feature field). For example, the multi-view visual data is associated with visual view features. As used herein, the term “visual view features” refers to RGB (or the like) image features of view i 20 and corresponding camera projection matrices from multiple two dimensional views. The visual view features in view I (e.g., shown as Ii in
In some examples, the aggregation encoder 12 is responsible for aggregating the multi-view visual data 11 (e.g., source views) into latent features 13. For example, the aggregation encoder 12 aggregates different source views into a coordinate-aligned feature field.
In some implementations, next, the rendering decoder 16 composes coordinate-aligned features from the latent features 13 along a target ray of a target view 22 to obtain the novel target views (e.g., including obtaining the color). For example, point-wise colors are mapped to token features and achieve weighted aggregation to get the final output by the rendering decoder 16.
In some examples, simultaneously, the label decoder 14 further integrates latent features 13 and maps them into object labels 15 (e.g., object categories). For example, the label decoder 14 non-linearly maps the latent features 13 into the object labels 15 (e.g., object categories).
In some implementations, to enhance the attending of semantic features to latent features, a self-supervised semantic understanding module 21 is included in the training process. The semantic understanding module 21 promotes improved feature representation. During the training phase, the semantic understanding module 21 is responsible for encouraging better semantic leading. For example, the semantic understanding module 21 further encourages semantic guidance in training by leading the learning of features by incorporating noise contrastive estimation.
Overview
Details of the proposed the semantic-guided transformer 10 for object recognition and radiance-field-based novel view synthesis (2R-TRM) are described in greater detail below. The main idea of the semantic-guided transformer 10 model is to attend object semantic information into visual features of different views.
and their corresponding camera projection matrix
indicating the poses from N different views are fed into the aggregation encoder (A), to aggregate different views into a coordinate-aligned feature field, Z. Then, the latent features z is fed into the rendering decoder (Dr), to obtain the rendered color c of camera ray r. Meanwhile, z is decoded by the label decoder (Dl). To make semantic features better attend latent features, the semantic understanding module (U) is designed to self-supervise the training process for a better feature representation.
Aggregation Encoder
As will be described in greater detail below, in some implementations, the aggregation encoder aggregates different source views into a coordinate-aligned feature field. More specifically, Given N source view images
{∈H×W×3
and their corresponding camera projection matrix
{Pi∈R3×4}N-1,
{∈3×4}i=0N-1
the goal of the aggregation encoder is to aggregate visual features extracted from source views according to the camera and geometry priors and output latent features that are used for later modules.
First, each source view (Ii) is encoded into feature map
f:
∈
H×W×3
∈H×W×d
using a U-Net based convolutional neural network, where dview is its corresponding dimension and i is the source view index. Compared with encoding an entire scene, one model disclosed herein interpolates the feature vector on the image plane. In addition, with the similar concern that involving every pixel of source view features is of high memory cost, one model disclosed herein introduces the epipolar geometry as indicative bias, restricting each pixel only attending to pixels that lie on the corresponding epipolar lines of neighboring views. Specifically, to obtain the feature at spatial coordinate x∈R3 and 2D viewing directions (θ,ϕ)∈[−π,π]2 (in practice, the directions are expressed as a 3D Cartesian unit vector d), the coordinates x are projected into feature space, and bilinear interpolation is performed with nearby neighboring view features on the corresponding epipolar lines. These features are regarded as coordinate-aligned token features ({ei}N-1) for the Transformer in Aggregation Encoder:
i(x,d)=((x),d)∈d
where {tilde over ( )}x=Ri(x)∈R2 is the projection from x to ith image plane by applying the extrinsic matrix, and Xi({tilde over ( )}x, d) computes the feature vector at coordinate {tilde over ( )}x via interpolation.
The positional embedding of the Transformer is obtained by a learnable positional encoder (PEA) over relative directions (Δd) of the source view to the target view:
p=P
(Δd)∈d
The reason for involving relative directions instead of absolute directions is that it is preferred to obtain closer token indexes over similar views between the source and target, since a smaller difference between the target view and the source views typically implies a larger probability of the target view resembling the corresponding colors at the source views and vice versa.
Then, the tokens are concatenated with the positional information and fed into the Transformer architecture (TRMA), achieving view aggregation and outputting coordinate-aligned latent features Z:
Z(x,d)=TRM([(x,d); (3)
In summary, Z(x, d) represents the 3D latent feature for a single query point location x on a ray r with the unit-length viewing direction d (e.g., the 5D location: 3D spatial coordinate x, and 2D viewing direction θ, ϕ), and involves the view difference between the target view and the source view. These will be used for the rendering decoder by querying target view 5D locations, and the label decoder by further integrating the latent features.
Rendering Decoder
As will be described in greater detail below, in some implementations, the point-wise colors are mapped to token features in the Transformer and achieve weighted aggregation to get the final output by rendering the decoder. More specifically, Generalizable 3D representations may be learnable from seen views to achieve novel view synthetization. The synthesis process can be regarded as a weighted aggregation of view features, and this process depends on the target view direction and its difference from the source view directions. In such a circumstance, the Transformer architecture attention mechanism is one of the solutions to achieve learning the correlations.
From Eq. 3 the target view 5D locations (xk, d) are injected to obtain the queried tokens, Z(xk, d), where xk=o+tkd, partitioning the near and far bounds ([tn, tf]) into M evenly spaced bins and randomly and uniformly sampling within each bin:
To aggregate the features from all sampled points, the Transformer (TRMDr) and its corresponding positional encoder (PEDr) are applied to the queried features due to its better self-attention property. The reason for using target directions instead of relative directions is to separate tokens from different view directions with different positional indexing. The accumulated colors along the ray (r=(o, d)) are then obtained by weighted pooling over the output self-attention-attended tokens, and then mapping the result to RGB space via a few-layer MLP:
c(r)=MLPrgb(meank=0M-1Wk·hk), (5)
h
k=([Z(xk,d);p]), (6)
p=P
(d), (7)
Where
{Wk}k=0M-1
are the pooling weights.
The Rendering Decoder is supervised by minimizing the mean squared error between the rendered colors and ground truth pixel colors in the training phase:
where B is the set of rays in each training batch.
Label Decoder
As will be described in greater detail below, in some implementations, the label decoder non-linearly maps the latent features into the object categories. More specifically, to represent the object category features, features along each source ray are first integrated by weighted pooling Wj latent features Z at all points in each ray direction. Then, an MLP is applied to the features to project the aggregated ray features into another feature space with dimension dobj. Formally, Eq. 9 illustrates how to obtain ray-related token features (fr) from 5D location-dependent features, Z(x, d).
Similar to Eq. 6 and 7, a transformer (TRMDc) and its positional encoder (PEDc) are used to integrate ray-related token inputs of all source rays. Different from previous Transformers, in order to perform classification, an extra learnable “classification token” is added to the sequence to perform classification.
z=TR
([fr;p]), (10)
p=P
(r), (11)
Finally, the classification results are generated from the output of TRMDc:
=FC(MLPcls(LN(z))), (12)
where FC(⋅) is the fully connected layer, LN (⋅) represents the layer normalization and MLPcls(⋅) denotes the MLP layers.
The Label Decoder is supervised by minimizing the cross-entropy loss.
l=ce(,gt), (13)
where ygt represents the ground-truth label.
Semantic Understanding Module
As will be described in greater detail below, in some implementations, the semantic understanding module is introduced to further facilitate the semantic information that assists the 2R dual task. The objective of this module is to encourage the latent features from the aggregation encoder to obtain a proper distance with the object semantic features in the pre-training phase. Equation 14 below aims to learn a better distance between latent features and semantic features. More specifically, a distance is adjusted between the latent features and object semantic features via the semantic understanding module. Specifically, Equation 14 encourages the learning of latent features such that there's a pronounced difference between the distances of the latent features to positive semantic features and to negative semantic features.
As described above, the latent features are learned from the multi-view visual data using an aggregation encoder. They capture the latent patterns of a category, such as its structural properties. The latent features are those learned from the current multi-view visual input of an object in category A.
As used herein, the term “semantic features” refers to the Global Vectors for Word Representation (GloVe) features associated with the category label, which can be a phrase or a word. GloVe is a method for obtaining vector representations (also known as embeddings) for words in natural language processing (NLP). These vector representations can capture semantic relationships between words based on their co-occurrence patterns in large corpora. To obtain GloVe word embeddings, one can either train their own embeddings on a specific corpus or use pre-trained embeddings.
The idea behind introducing positive and negative concepts with respect to the semantic features is to guide the learning of latent features. The goal is to ensure that the latent features maintain a suitable distance to the positive semantic features compared to the distance they have to the negative semantic features.
More specifically, Equation 14 uses two types of semantic features. The positive semantic features are the GloVe features corresponding to semantic label A. Conversely, the negative semantic features are the GloVe features for semantic label B. Category B is randomly selected from the dataset, provided it's distinct from category A. For example, each category (e.g., category A and category B) comprises multiple objects. The relationship between the semantic label, category, and semantic features can be illustrated as follows: category—class index 0, semantic label—the word “chair”, semantic features—a vector representing the GloVe features of the word “chair”.
A noise contrastive estimation (NCE) is employed over features to bring the latent features for each source view ray r closer to the object semantic features (s+) than the other features (s−) randomly sampled from the list. For a shorter notation, Z(r) is used to represent the mean pooling of hidden features along the ray r. The semantic features of the object are the GloVe embedding of the object name.
In this work, the softmax version of the NCE loss is used:
where O represents all rays for the object view, N is the set of negative pairs, meaning that the semantic features do not match current latent features, and do not describe the current 3D object. g(⋅) maps GloVe features into the same dview-dimensional vector space as Z.
Implementation Details
In an experimental implementation described in greater detail below, the extraction of multi-view image features is accomplished through an architecture that resembles a U-Net. The aggregation encoder configuration was based on GNT. The scale of the Transformer may be reduced, with three layers and four attention heads. Both the rendering and label decoders possess a Transformer structure consisting of four layers and four attention heads. In some implementations, the dimension of the latent feature vector, Z, is 64, while all other hidden layer dimensions are 32. The semantic understanding module is established for the pre-training phase. For example, 30% of training data will be used as pre-training data, and neither of them is seen in the evaluation. In some examples, the related parameters are not frozen during the training phase; instead, they only function as semantic guidance for the model. For the experiments below, the negative sample number was set to 5. Distributed data-parallel training was performed on four NVIDIA A100 GPUs, with the learning rate set to e-4 with the Adam optimizer.
Experiment Results:
The results, as shown in Table 1, demonstrate that the 2R-TRM model disclosed herein outperforms the other methods in terms of accuracy in both sets. This indicates the superiority of the model disclosed herein in 3D object recognition under limited training data.
The model disclosed herein is compared against other approaches on the ModelNet40 dataset. The results highlight that the model disclosed herein outperforms the other methods that use additional data for pre-training or input a higher number of views. This demonstrates the superior learning efficiency of the model disclosed herein, particularly when dealing with limited data.
Additionally, experiments 16 and 19 provide evidence that contrastive learning on visual inputs plays a role in 3D object recognition, further emphasizing the effectiveness of the model disclosed herein.
The columns labeled “Label Decoder” and “Semantic Understanding Module” indicate which model was used in the 3D object recognition stream and whether the semantic understanding mechanism was employed during training, respectively. Experiments 4 and 6 demonstrate that the Transformer architecture outperforms other models in aggregating features from multiple views to predict the label. Additionally, experiments 6 and 7 illustrate the effectiveness of the contrastive learning mechanism in the semantic understanding module. Experiments 24 to 27 show that the benefits of semantic understanding extend beyond 3D object recognition, as such an approach also improves the performance of novel view rendering by facilitating the learning of generalization features.
Computer program code to carry out operations shown in the method 70 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
In some examples, the methods described herein may be performed at least in part by cloud processing. It will be appreciated that some or all of the operations described herein that have been described using a “pull” architecture (e.g., polling for new information followed by a corresponding response) may instead be implemented using a “push” architecture (e.g., sending such information when there is new information to report), and vice versa.
Illustrated processing block 72 provides for encoding multi-view visual data. For example, visual view features of multi-view visual data are encoded into latent features via an aggregator encoder.
For example, latent features are learned through the joint task of integrating 3D object recognition and radiance field view synthesis to incorporate semantic information from 3D object recognition to aid radiance field view synthesis rendering and incorporate radiance field information from radiance field view synthesis rendering of a 3D scene to enhance the 3D object recognition.
As discussed above, a specific structural feature of the techniques described herein are the latent features extracted from the aggregation encoder. These latent features are learned through the joint task of integrating 3D object recognition and RF-based novel view synthesis. These latent features capture the underlying patterns and relationships between the two modalities by incorporating semantic information from recognition to aid rendering and incorporating radiance field features from the 3D scene to enhance the recognition task. These latent features enable improved performance and providing valuable insights into the interplay between object recognition and novel view synthesis.
In some examples, the aggregation encoder 12 is responsible for aggregating the multi-view visual data 11 (e.g., source views) into latent features 13. For example, the aggregation encoder 12 aggregates different source views into a coordinate-aligned feature field.
Illustrated processing block 74 provides for decoding the latent features into one or more novel target views. For example, the latent features are decoded into one or more novel target views different from views of the multi-view visual data via a rendering decoder.
In some implementations, the rendering decoder composes coordinate-aligned features from the latent features along a target ray of a target view to obtain the novel target views (e.g., including obtaining the color). For example, point-wise colors are mapped to token features and achieve weighted aggregation to get the final output by the rendering decoder.
Illustrated processing block 76 provides for decoding the latent features into an object label. For example, the latent features are decoded into an object label via a label decoder.
In some implementations, the operation to decode the latent features via the rendering decoder and to decode the latent features via the label decoder occur at least partially at the same time.
For example, the label decoder and rendering decoder operate on a unified data source of latent features. Accordingly, the label decoder and rendering decoder are able to operate in parallel, and potentially operate in parallel simultaneously. For example, the combination of two tasks realizes an application that can simultaneously generate 360-degree rendered video and recognize 3D objects.
In some examples, the label decoder and rendering decoder operate simultaneously on the same latent features. For example, the label decoder further integrates latent features and maps them into object categories. In some implementations, the label decoder non-linearly maps the latent features into the object categories.
In operation, the method 70 therefore enhances performance at least to the extent that by integrating the tasks of 3D object recognition and RF-based novel view synthesis, the semantic-guided transformer for object recognition and radiance-field-based novel view synthesis (2R-TRM) provides mutual benefits to both domains. The integration of these two tasks leads to mutual benefit. On the one hand, RF-based novel view synthesis yields superior 3D object representations, capturing crucial details pertaining to texture, shape, and structure of objects, thereby facilitating the model's ability to differentiate among various object categories. On the other hand, 3D object recognition endows RF-based novel view synthesis with semantic knowledge. As the semantic label embodies an abstract understanding and generalization of the object, combining the recognition task provides guidance to RF-based novel view synthesis learning, consequently enhancing model efficiency.
Additional and/or alternative operations for method 70 are described in greater detail below in the description of
As illustrated, operations 82-88 are discussed as occurring during training of the semantic-guided transformer.
Illustrated processing block 82 provides for encoding the multi-view visual data into latent features. For example, the visual view features are encoded from the multi-view visual data into latent features via the aggregator encoder.
Illustrated processing block 86 provides for adjusting a distance between the latent features and semantic features. For example, a distance is adjusted between the latent features and the semantic features via a semantic understanding module.
Illustrated processing block 88 provides for encoding the multi-view visual data into latent features. For example, visual view features are encoded from the multi-view visual data into latent features based on the adjusted distance between the latent features and the semantic features via the aggregator encoder.
In operation, in some implementations, to enhance the attending of semantic features to latent features, a self-supervised semantic understanding module 21 is included in the training process. The semantic understanding module 21 promotes improved feature representation. During the training phase, the semantic understanding module 21 is responsible for encouraging better semantic leading. For example, the semantic understanding module 21 further encourages semantic guidance in training by leading the learning of features by incorporating noise contrastive estimation.
As illustrated, operations 92-96 may generally be incorporated into block 72 (
Illustrated processing block 92 provides for aggregating the multi-view visual data into a coordinate-aligned feature field. For example, the multi-view visual data is aggregated into a coordinate-aligned feature field via the aggregator encoder.
As used herein the term “coordinate-aligned feature field” refers to a 3D representation where every point or coordinate in that space is mapped to a unique feature descriptor or vector. This representation is constructed by converting multi-view images into a consistent and spatially aligned 3D grid of features. For example, the coordinate-aligned feature field provides a way to fuse information from multiple perspectives into a unified 3D feature space that corresponds to specific spatial coordinates.
In some examples, the multi-view visual data includes a plurality of red-green-blue images and a plurality of corresponding camera projection matrices. In some implementations, the multi-view visual data includes views of a plurality of different objects received together by the aggregator encoder.
Illustrated processing block 94 provides for performing semantic object recognition operations. For example, semantic object recognition operations are performed based on radiance field view synthesis operations via the aggregator encoder.
As used herein the term, “semantic object recognition operations” includes providing a semantic label (e.g., an object category) based on object recognition of physical objects. Such a semantic label embodies an abstract understanding and generalization of an object into an object category that semantically describes the object. In some examples, the label decoder integrates latent features and maps them into object labels (e.g., object categories). For example, the label decoder non-linearly maps the latent features into the object labels (e.g., object categories).
Separately, the term “radiance field view synthesis operations” refers to an intricate technique that captures and reconstructs the visual appearance of objects or scenes by employing a process that involves capturing images from varying view points and subsequently approximating the radiance field (e.g., a mathematical representation that describes the visual appearance of an object as a function of viewing direction). More generally, the radiance field view synthesis is utilized to generate novel views of complex 3D scenes based on a partial set of 2D images.
These two tasks are unified by integrating object semantic information into visual features from multiple viewpoints. This integration enhances the learning of latent features and underlying patterns. The technology described herein integrates two formerly separate research tasks: semantic object recognition operations (e.g., 3D object recognition) and radiance field view synthesis operations (e.g., radiance-field-based novel view synthesis), which aims to classify and represent 3D objects based on visual information from multiple viewpoints, respectively. Both tasks involve understanding and recognizing 3D object materials, shapes, and structures from multiple viewpoints. The challenges in these tasks include accounting for scene/environment properties, such as lighting conditions, and addressing deficiencies such as cross-scene/object generalization and learning efficiency in radiance field representation. These two tasks are integrated to leverage their mutual benefits, where radiance field view synthesis operations enhance semantic object recognition operations by capturing crucial object details, while semantic object recognition operations provides semantic knowledge to aid radiance field view synthesis learning.
For example, the radiance field view synthesis operations yield superior semantic object recognition operations by capturing crucial details pertaining to texture, shape, and structure of objects, thereby facilitating the model's ability to differentiate among various object categories during semantic object recognition operations.
Illustrated processing block 96 provides for performing radiance field view synthesis operations based on semantic object recognition operations. For example, radiance field view synthesis operations are performed based on semantic object recognition operations via the aggregator encoder.
For example, semantic object recognition operations endows radiance field view synthesis operations with semantic knowledge. As the semantic label (e.g., object category) embodies an abstract understanding and generalization of the object, combining the radiance field view synthesis operation task provides guidance to radiance field view synthesis operations learning, consequently enhancing model efficiency.
As illustrated, operations 102-104 may generally be incorporated into block 74 (
Illustrated processing block 102 provides for obtaining a rendered color of a given camera ray as a point-wise color. For example, a rendered color of a given camera ray is obtained as a point-wise color via the rendering decoder.
Illustrated processing block 104 provides for map, via the rendering decoder, one or more token features to the pointwise color. For example, one or more token features is mapped to the pointwise color via the rendering decoder.
As used herein, “token features” is a general term referring to the representations or embeddings used in the Transformer architecture, associated with individual components. These components can include region features, pixel-wise features, or other elements in the input sequence.
As illustrated, operation 112 may generally be incorporated into block 76 (
Turning now to
In the illustrated example, the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM). In an embodiment, an IO (input/output) module 288 is coupled to the host processor 282. The illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), mass storage 302 (e.g., hard disk drive/HDD, optical disc, solid state drive/SSD) and a network controller 292 (e.g., wired and/or wireless). The host processor 282 may be combined with the IO module 288, a graphics processor 294, and an AI accelerator 296 into a system on chip (SoC) 298.
In an embodiment, the host processor 282 and/or the AI accelerator 296 executes a set of program instructions 300 retrieved from the mass storage 302 and/or the system memory 286 to perform one or more aspects of the method 70 (
The instructions 300 may also be implemented in a distributed architecture (e.g., distributed in both location and over time). For example, the compacted encoding of multi-view visual data into latent features may occur on a separate first processor (not shown) at an earlier time than the execution of the transformer-based neural network decoding on the SoC 298 of the computing system 280 (e.g., a different separate remote second processor at a later time, independent of the earlier processing time). Furthermore, the results of a decoding operation may be stored on a different separate remote third processor (not shown), to be displayed to a human user at a later time, independent of earlier processing times. Thus, the computing system 280 may be understood as illustrating one of a plurality of devices, rather than a single device.
Accordingly, the various processing stages may be initiated based on network messages between distributed processors, using suitable networking protocols, as known to those skilled in the art. For example, the TCP/IP (Transmission Control Protocol/Internet Protocol) suite of protocols, among others. The storage and retrieval of pre-processing, intermediate, and final results may be stored in databases using SQL (Structured Query Language) or No-SQL programming interfaces, among others. The storage elements may be physically located at different places than the processing elements.
The computing system 280 is therefore considered performance-enhanced at least to the extent that by integrating the tasks of 3D object recognition and RF-based novel view synthesis, the semantic-guided transformer for object recognition and radiance-field-based novel view synthesis (2R-TRM) provides mutual benefits to both domains. The integration of these two tasks leads to mutual benefit. On the one hand, RF-based novel view synthesis yields superior 3D object representations, capturing crucial details pertaining to texture, shape, and structure of objects, thereby facilitating the model's ability to differentiate among various object categories. On the other hand, 3D object recognition endows RF-based novel view synthesis with semantic knowledge. As the semantic label embodies an abstract understanding and generalization of the object, combining the recognition task provides guidance to RF-based novel view synthesis learning, consequently enhancing model efficiency.
The logic 354 may be implemented at least partly in configurable or fixed-functionality hardware. In one example, the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction. The logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352.
The processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 450 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order retirement of instructions. Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 400 is transformed during execution of the code 413, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 425, and any registers (not shown) modified by the execution logic 450.
Although not illustrated in
Referring now to
The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in
As shown in
Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b. The shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared cache 1896a, 1896b may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof.
While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the various processing elements 1070, 1080 may reside in the same die package.
The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in
The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively. As shown in
In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
As shown in
Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of
Example 1 includes a computing system comprising a network controller, a processor coupled to the network controller, and a memory coupled to the processor, the memory including a set of instructions, which when executed by the processor, cause the processor to encode, via an aggregator encoder, multi-view visual data into latent features, decode, via a rendering decoder, the latent features into one or more novel target views different from views of the multi-view visual data, and decode, via a label decoder, the latent features into an object label.
Example 2 includes the computing system of Example 1, wherein the operation to decode the latent features via the rendering decoder and to decode the latent features via the label decoder occur at least partially at the same time.
Example 3 includes the computing system of any one of Examples 1 to 2, wherein the instructions, when executed, further cause the processor to adjust, via a semantic understanding module, a distance between the latent features and the semantic features, and wherein the operation to encode, via the aggregator encoder, the multi-view visual data into latent features is based on the adjusted distance between the latent features and the semantic features.
Example 4 includes the computing system of any one of Examples 1 to 3, wherein the operation to encode, via the aggregator encoder, the multi-view visual data into the latent features further comprises operations to perform, via the aggregator encoder, semantic object recognition operations based on radiance field view synthesis operations, and perform, via the aggregator encoder, radiance field view synthesis operations based on semantic object recognition operations.
Example 5 includes the computing system of any one of Examples 1 to 4, wherein the instructions, when executed, further cause the processor to aggregate, via the aggregator encoder, the multi-view visual data into a coordinate-aligned feature field, wherein multi-view visual data comprises a plurality of red-green-blue images and a plurality of corresponding camera projection matrices, and wherein multi-view visual data comprises views of a plurality of different objects received together by the aggregator encoder.
Example 6 includes the computing system of any one of Examples 1 to 5, wherein the operation to decode, via the rendering decoder, the latent features into one or more novel target views further comprises operations to obtain, via the rendering decoder, a rendered color of a given camera ray as a point-wise color, and map, via the rendering decoder, one or more token features to the pointwise color, wherein the operation to decode, via the label decoder, the latent features into the object label comprises operations to non-linearly map the latent features into a plurality of object categories.
Example 7 includes the computing system of any one of Examples 1 to 6, wherein the latent features are learned through the joint task of integrating 3D object recognition and radiance field view synthesis to incorporate semantic information from 3D object recognition to aid radiance field view synthesis rendering and incorporate radiance field information from radiance field view synthesis rendering of a 3D scene to enhance the 3D object recognition.
Example 8 includes at least one computer readable storage medium comprising a set of instructions, which when executed by a computing system, cause the computing system to encode, via an aggregator encoder, multi-view visual data into latent features, decode, via a rendering decoder, the latent features into one or more novel target views different from views of the multi-view visual data, and decode, via a label decoder, the latent features into an object label.
Example 9 includes the at least one computer readable storage medium of Example 8, wherein the operation to decode the latent features via the rendering decoder and to decode the latent features via the label decoder occur at least partially at the same time.
Example 10 includes the at least one computer readable storage medium of any one of Examples 8 to 9, wherein the instructions, when executed, further cause the computing system to adjust, via a semantic understanding module, a distance between the latent features and semantic features, and wherein the operation to encode, via the aggregator encoder, the multi-view visual data into latent features is based on the adjusted distance between the latent features and the semantic features.
Example 11 includes the at least one computer readable storage medium of any one of Examples 8 to 10, wherein the operation to encode, via the aggregator encoder, the multi-view visual data into the latent features further comprises operations to perform, via the aggregator encoder, semantic object recognition operations based on radiance field view synthesis operations, and perform, via the aggregator encoder, radiance field view synthesis operations based on semantic object recognition operations.
Example 12 includes the at least one computer readable storage medium of any one of Examples 8 to 11, wherein the instructions, when executed, further cause the computing system to aggregate, via the aggregator encoder, the multi-view visual data into a coordinate-aligned feature field, wherein multi-view visual data comprises a plurality of red-green-blue images and a plurality of corresponding camera projection matrices, and wherein multi-view visual data comprises views of a plurality of different objects received together by the aggregator encoder.
Example 13 includes the at least one computer readable storage medium of any one of Examples 8 to 12, wherein the operation to decode, via the rendering decoder, the latent features into one or more novel target views further comprises operations to obtain, via the rendering decoder, a rendered color of a given camera ray as a point-wise color, and map, via the rendering decoder, one or more token features to the pointwise color.
Example 14 includes the at least one computer readable storage medium of any one of Examples 8 to 14, wherein the operation to decode, via the label decoder, the latent features into the object label comprises operations to non-linearly map the latent features into a plurality of object categories.
Example 15 includes a method comprising encoding, via an aggregator encoder, multi-view visual data into latent features, decoding, via a rendering decoder, the latent features into one or more novel target views different from views of the multi-view visual data, and decoding, via a label decoder, the latent features into an object label.
Example 16 includes the method of Example 15, wherein the operation to decode the latent features via the rendering decoder and to decode the latent features via the label decoder occur at least partially at the same time.
Example 17 includes the method of any one of Examples 15 to 16, further comprising adjusting, via a semantic understanding module, a distance between the latent features and semantic features, and wherein the operation to encode, via the aggregator encoder, the multi-view visual data into latent features is based on the adjusted distance between the latent features and the semantic features.
Example 18 includes the method of any one of Examples 15 to 17, further comprising aggregating, via the aggregator encoder, the multi-view visual data into a coordinate-aligned feature field, wherein multi-view visual data comprises a plurality of red-green-blue images and a plurality of corresponding camera projection matrices, and wherein multi-view visual data comprises views of a plurality of different objects received together by the aggregator encoder.
Example 19 includes the method of any one of Examples 15 to 18, wherein the operation to decode, via the rendering decoder, the latent features into one or more novel target views further comprises operations to obtain, via the rendering decoder, a rendered color of a given camera ray as a point-wise color, and map, via the rendering decoder, one or more token features to the pointwise color.
Example 20 includes the method of any one of Examples 15 to 119, wherein the operation to decode, via the label decoder, the latent features into the object label comprises operations to non-linearly map the latent features into a plurality of object categories.
Example 21 includes the method of any one of Examples 15 to 20, wherein the operation to encode, via the aggregator encoder, the multi-view visual data into the latent features further comprises operations to perform, via the aggregator encoder, semantic object recognition operations based on radiance field view synthesis operations, and perform, via the aggregator encoder, radiance field view synthesis operations based on semantic object recognition operations.
Example 22 includes an apparatus comprising means for performing the method of any one of Examples 15 to 21.
Technology described herein therefore enables AI (e.g., machine learning) tools to be created for integrating the tasks of 3D object recognition and RF-based novel view synthesis to provides mutual benefits to both domains.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.