The following description relates to a method and apparatus with neural network data processing and/or training.
Training data for a neural network (NN) may correspond to a subset of real data. Accordingly, through training of the NN, an output error for input training data may decrease, but an output error for input real data may increase. This increase in the output error for input real data may result from “overfitting,” which refers to a phenomenon in which an error for real data increases by excessively training the NN based on training data. That is, due to overfitting, an error of the NN may increase.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In one general aspect, a processor-implemented neural network method includes: receiving input data; obtaining a plurality of parameter vectors representing a hierarchical-hyperspherical space comprising a plurality of spheres belonging to a plurality of layers; applying the plurality of parameter vectors to generate a neural network; and generating an inference result by processing the input data using the neural network.
The neural network may include a convolutional neural network (CNN), and the plurality of parameter vectors may include a plurality of filter parameter vectors.
The input data may include image data.
The receiving of the input data may include capturing the input data, and the generating of the inference result may include performing recognition of the input data.
The plurality of layers may correspond to different hierarchical levels in the hierarchical-hyperspherical space.
Centers of spheres, of the plurality of spheres, belonging to a same layer, of the plurality of layers, in the hierarchical-hyperspherical space may be determined based on a center of a sphere belonging to an upper layer of the same layer.
A radius of a sphere, of the plurality of spheres, belonging to a predetermined layer, of the plurality of layers, in the hierarchical-hyperspherical space may be less than a radius of a sphere belonging to an upper layer of the predetermined layer.
A center of a sphere, of the plurality of spheres, belonging to a predetermined layer, of the plurality of layers, in the hierarchical-hyperspherical space may be located in a sphere belonging to an upper layer of the predetermined layer.
Spheres belonging to a same layer, of the plurality of layers, in the hierarchical-hyperspherical space may not overlap one another.
A distribution of the plurality of parameter vectors may be greater than a threshold distribution, and the distribution of the plurality of parameter vectors may indicate a degree by which the plurality of parameter vectors may be globally and uniformly distributed in the hierarchical-hyperspherical space.
The distribution of the plurality of parameter vectors may be determined based on a combination of a discrete distance between the plurality of parameter vectors and a continuous distance between the plurality of parameter vectors.
The discrete distance may be determined by quantizing the plurality of parameter vectors and calculating a hamming distance between the quantized parameter vectors.
The continuous distance may include an angular distance between the plurality of parameter vectors.
Each of the plurality of parameter vectors may include a center vector indicating a center of a corresponding sphere and a surface vector indicating a surface of the corresponding sphere.
The applying of the plurality of parameter vectors to the neural network may include, for each of the plurality of parameter vectors: generating a projection vector based on the center vector and the surface vector; and applying the projection vector to the neural network.
The generating of the inference result by processing the input data using the neural network may include performing hyperspherical convolutions based on the input data and the generated projection vectors.
The input data may be training data, and the method may include: determining a loss term based on a label of the training data and a result of the processing of the training data; determining a regularization term; and training the plurality of parameter vectors based on the loss term and the regularization term.
In another general aspect, a processor-implemented neural network method includes: receiving training data; processing the training data using a neural network; determining a loss term based on a label of the training data and a result of the processing of the training data; determining a regularization term such that a plurality of parameter vectors of the neural network represent a hierarchical-hyperspherical space comprising a plurality of spheres belonging to a plurality of layers; and training the plurality of parameter vectors based on the loss term and the regularization term, to generate an updated neural network.
The neural network may include a convolutional neural network (CNN), the plurality of parameter vectors may include a plurality of filter parameter vectors, and the training data may include image data.
Centers of spheres, of the plurality of spheres, belonging to a same layer, of the plurality of layers, in the hierarchical-hyperspherical space may be determined based on a center of a sphere belonging to an upper layer of the same layer.
The regularization term may be determined based on any one or any combination of: a first constraint condition in which a radius of a sphere, of the plurality of spheres, belonging to a predetermined layer, of the plurality of layers, in the hierarchical-hyperspherical space is less than a radius of a sphere belonging to an upper layer of the predetermined layer; a second constraint condition in which a center of the sphere belonging to the predetermined layer is located in the sphere belonging to the upper layer of the predetermined layer; and a third constraint condition in which spheres belonging to a same layer in the hierarchical-hyperspherical space do not overlap one another.
The regularization term may be determined such that a distribution of the plurality of parameter vectors may be greater than a threshold distribution, and the distribution of the plurality of parameter vectors may indicate a degree by which the plurality of parameter vectors may be globally and uniformly distributed in the hierarchical-hyperspherical space.
The distribution of the plurality of parameter vectors may be determined based on a combination of a discrete distance between the plurality of parameter vectors and a continuous distance between the plurality of parameter vectors.
The discrete distance may be determined by quantizing the plurality of parameter vectors and calculating a hamming distance between the quantized parameter vectors; and the continuous distance may include an angular distance between the plurality of parameter vectors.
Each of the plurality of parameter vectors may include a center vector indicating a center of a corresponding sphere and a surface vector indicating a surface of the corresponding sphere.
The regularization term may be determined based on any one or any combination of: a first distance term based on a distance between center vectors of spheres, of the plurality of spheres, belonging to a same layer, of the plurality of layers, in the hierarchical spherical space; a second distance term based on a distance between surface vectors of the spheres belonging to the same layer in the hierarchical spherical space; a third distance term based on a distance between center vectors of spheres, of the plurality of spheres, belonging to different layers, of the plurality of layers, in the hierarchical spherical space; and a fourth distance term based on a distance between surface vectors of the spheres belonging to the different layers in the hierarchical spherical space.
A non-transitory computer-readable storage medium may store instructions that, when executed by a processor, configure the processor to perform the method.
In another general aspect, a neural network apparatus may include: a communication interface configured to receive input data; a memory storing a plurality of parameter vectors representing a hierarchical-hyperspherical space comprising a plurality of spheres belonging to a plurality of layers; and a processor configured to apply the plurality of parameter vectors to generate a neural network and to generate an inference result by a configured implementation of a processing of the input data using the generated neural network.
The apparatus may include an image sensor configured to interact with the communication interface to provide the received input data, wherein the communication interface may be configured to receive from an outside the parameter vectors and store the parameter vectors in the memory.
The apparatus may include instructions that, when executed by the processor, configure the processor to implement the communication interface to receive the input data, and to implement the neural network to generate the inference result.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Throughout the drawings and the detailed description, unless otherwise described or provided, the same drawing reference numerals will be understood to refer to the same elements, features, and structures. The drawings may not be to scale, and the relative size, proportions, and depiction of elements in the drawings may be exaggerated for clarity, illustration, and convenience.
The following detailed description is provided to assist the reader in gaining a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent after an understanding of the disclosure of this application. For example, the sequences of operations described herein are merely examples, and are not limited to those set forth herein, but may be changed as will be apparent after an understanding of the disclosure of this application, with the exception of operations necessarily occurring in a certain order. Also, descriptions of features that are known after an understanding of the disclosure of this application may be omitted for increased clarity and conciseness.
The features described herein may be embodied in different forms, and are not to be construed as being limited to the examples described herein. Rather, the examples described herein have been provided merely to illustrate some of the many possible ways of implementing the methods, apparatuses, and/or systems described herein that will be apparent after an understanding of the disclosure of this application.
The following structural or functional descriptions of examples disclosed in the present disclosure are merely intended for the purpose of describing the examples and the examples may be implemented in various forms. The examples are not meant to be limited, but it is intended that various modifications, equivalents, and alternatives are also covered within the scope of the claims.
Although terms of “first” or “second” are used herein to describe various members, components, regions, layers, or sections, these members, components, regions, layers, or sections are not to be limited by these terms. Rather, these terms are only used to distinguish one member, component, region, layer, or section from another member, component, region, layer, or section. Thus, a first member, component, region, layer, or section referred to in examples described herein may also be referred to as a second member, component, region, layer, or section without departing from the teachings of the examples.
Throughout the specification, when a component is described as being “connected to,” or “coupled to” another component, it may be directly “connected to,” or “coupled to” the other component, or there may be one or more other components intervening therebetween. In contrast, when an element is described as being “directly connected to,” or “directly coupled to” another element, there can be no other elements intervening therebetween. Likewise, similar expressions, for example, “between” and “immediately between,” and “adjacent to” and “immediately adjacent to,” are also to be construed in the same way. As used herein, the term “and/or” includes any one and any combination of any two or more of the associated listed items.
As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terminology used herein is for describing various examples only and is not to be used to limit the disclosure. The articles “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms “comprises,” “includes,” and “has” specify the presence of stated features, numbers, operations, members, elements, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, numbers, operations, members, elements, and/or combinations thereof.
Unless otherwise defined, all terms, including technical and scientific terms, used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains and based on an understanding of the disclosure of the present application. Terms, such as those defined in commonly used dictionaries, are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the disclosure of the present application and are not to be interpreted in an idealized or overly formal sense unless expressly so defined herein. The use of the term “may” herein with respect to an example or embodiment (e.g., as to what an example or embodiment may include or implement) means that at least one example or embodiment exists where such a feature is included or implemented, while all examples are not limited thereto.
Hereinafter, examples will be described in detail with reference to the accompanying drawings, and like reference numerals in the drawings refer to like elements throughout.
To solve the technological problem of overfitting, one or more embodiments of the present disclosure may train a neural network using a regularization numerical analysis technique to advantageously decrease an output error for input real data.
In an example, a group between parameter vectors for samples with the same or sufficiently similar characteristic may be formed and a regularization may be applied to the group. In an example, the samples may include input images and the parameter vectors may include filter parameter vectors (or weight parameter vectors) of a filter (or kernel) of a convolutional neural network (CNN). In this example, a class for defining each group may be referred to as a “super-class.” For each sample of a class, a pair of coarse super-classes and coarse sub-classes and a pair of fine super-classes and fine sub-classes may be defined, to form a layer of a hyperspherical space.
Since it is typically difficult to measure a pairwise distance between high dimensional vectors with a hierarchical structure in the same space, one or more embodiments of the present disclosure may construct another identification space including a space isolated from the original space.
Here, the d-sphere refers to a set of points satisfying ={w∈:∥w∥=1}, for example.
Multiple separated hyperspheres may be constructed using multiple identifying relationships. In an example, a single space may be decomposed into multiple spaces, and redefined in terms of a hierarchical point of view, and accordingly a hierarchical structure may be applied to a regularization of a parameter vector of a hyperspherical space for each of multiple groups. To uniformly distribute parameter vectors on a unit hypersphere, the parameter vectors may be sampled from a Gaussian normal distribution. This is because the Gaussian normal distribution is spherically symmetric. Also, in a Bayesian point of view, a neural network with a Gaussian prior may induce an L2-norm regularization.
Based on the above description, a parameter vector of the neural network for the hyperspherical space may be trained to have a Gaussian prior. A projection vector calculated by a difference arithmetic operation between two parameter vectors in the Gaussian normal distribution may indicate a normal difference distribution.
In a deep neural network, an objective function with a regularization in addition to a loss , (W)=(x,W)+(W), may optimize a parameter tensor W near a minimum loss , arg minW(x,W) in which x∈ denotes an input vector. The parameter tensor may be a multi-dimensional matrix and may include a matrix or a vector, as non-limiting examples.
The term “parameter vector” used herein may be a parameter tensor or a parameter matrix, depending on examples.
W={Wi∈:Wi={wj∈}, j=1, . . . , ci, i=1, . . . , L} denotes metrics (for example, neuron connective weights or kernels) of a parameter vector, L denotes a number of layers, and λ>0 is to control a degree of a regularization, for example.
For example, for a classification task, a cross entropy loss may be used for the loss function .
In an example, a regularization may be performed using a new regularization formulation .
w, an element of W at a single layer, denotes a projection vector to transform a given input into an embedding space defined in a Euclidean metric space x∈wTx∈, for example.
By defining a unit-length projection w/∥w∥, a new parameter vector ŵ may be defined on the d-sphere ={ŵ∈:∥ŵ∥=1} in which ∥⋅∥ denotes l2-norm and a center is zero. In other words, a projection vector ŵ may be defined by a center vector wc∈ indicating a center of a hypersphere and a surface vector ws∈ that uses an arithmetic operation ŵ:=ws−wc, for example.
In an example, a d-sphere ={ws−wc∈:∥ws−wc∥=1} may be defined by the center vector wc and the surface vector ws. Hereinafter, for a simplicity of a notation, w is used instead of ŵ.
In an example, when a radius is regarded to be “1”, a parameter vector has a radius r >0.
A radius of a global area converges to
when a level l goes to infinity.
denotes a sum of radius series, and δ denotes a constant.
Also, r0 denotes an initial radius of a sphere, and the constant δ is a ratio between radiuses
of which an absolute value is less than “1”.
In an example, a parameter vector may be trained such that a diversity increases using a parameter vector such as a projection matrix or a projection vector as a transformation of an input vector. For example, a diversity of parameter vectors may be increased by a regularization through a globally uniform distribution between the parameter vectors. To this end, semantics between parameter vectors may be applied through a hierarchical space, and a distribution between high-dimensional parameter vectors may be diversified based on a distance metric in the same semantic space (for example, spheres belonging to the same layer in a single group) and a different semantic space (for example, spheres belonging to different layers).
In
is satisfied, and {right arrow over (w)}″ may exist in multiples of δ. The projection vector {right arrow over (w)}, the surface vector {right arrow over (w)}s and the center vector {right arrow over (w)}c may respectively correspond to the above-described vectors ŵ, ws and wc, for example.
For example, a hierarchical structure of a hypersphere may include a levelwise structure with a notation (l) and a groupwise structure with a notation g.
Levelwise Structure
Parameter vectors for may be defined by a levelwise notation (l) as shown in Equation 1 below, for example.
w
(l)
:=w
s
(l)
−w
c
(l) Equation 1:
In Equation 1, the parameter vectors are defined as for an l-level of a d-th sphere.
For example, hierarchical parameter vectors are defined in a higher dimensional space than those of
In a levelwise setting, ws(l) and wc(l) may be represented as wc(l−1)+{right arrow over (Δw)}(l)wc(l) based on a center vector calculated in a previous level.
denotes an accumulated center vector, and denotes a parameter vector newly connected from wc(l−1) to wc(l).
By denoting {right arrow over (Δw)}(l) as w(l,l−1), a center vector at an l-level may be defined as wc(l):=wc(l,l−1)+wc(l−1) and a surface vector may be defined as ws(l):=ws(l,l−1)+wc(l−1).
Both a center vector and a surface vector at a current level may be based on a center vector at a previous level. However, since all samples do not include a child sample, it may be more advantageous to perform branching from a representative parameter or a center parameter rather than from an individual projection vector.
A level may correspond to each layer in a hierarchical structure. In the following description, the terms “level” and “layer” are understood to have the same meaning.
Equation 1 described above is expressed by Equation 2 shown below, for example.
w
(l)
=w
s
(l,l−1)
−w
c
(l,l−1) Equation 2:
For example, using (l,l−1), a vector connected from a center vector at an (l−1)-th level to an (l)-th level is denoted.
Groupwise Structure
By a group notation gk, the center vector in Equation 1 may be expressed as wc,g
of gk group at the l-th level.
denotes a group set at the l-th level, and |⋅| denotes a cardinality.
A group g(l) at the current level may be adjusted in a group of a previous level
in which g(l−)⊆(l−1).
With a groupwise relationship for levels, an adjacency indication
may be calculated. Depending on examples, the adjacency indication may be replaced with a probability model. Thus, a projection vector at the l-th level may be determined as
in which i=1, . . . , |gk|.
Also, {ws,g
A representative vector of the group gk at the (l) level is wc,g
When the representative vector of the group gk is determined by a predetermined vector and the center vector at the previous level, an adjustment factor ϵ may be used as wc,g
In an example, parameter vectors for each layer may be defined based on a center vector in a spherical space, which may be suitable for training for each group. For example, a regularization may be performed by defining a center and/or a radius of each of spheres included in a hierarchical-hyperspherical space and by assigning a constraint condition to a space for each group.
A regularization term of a hierarchical parameter vector defined above is defined below.
A set of parameter vectors {Ws,g
In Equation 3, operates on an individual sphere
λl∈>0, and l denotes a constraint term to apply geometry-aware constraints to a sphere. For example, the constraint term l may correspond to a constraint on a relationship between spheres which indicates how the relationship between spheres is to be formed.
Equation 3 may be used for a regularization between an upper layer and a lower layer.
includes two regularization terms as shown in Equation 4 below:
a term l,p for projection vectors in the same group gk of
and
a term l,c for center vectors across groups at the same level of
for example.
l(Ws,g
In Equation 4, l,p is a regularization term of a distance between projection vectors and may be expressed as shown in Equation 5 below, for example. Also, l,c is a regularization term of a distance between center vectors and may be expressed as shown in Equation 6 below, for example.
In Equation 5 and 6, wg
For example, when a mini batch is given, the regularization term may be
In addition to the above hierarchical regularization of Equation 3, an orthogonality promoting term may be applied to a center vector
In
denotes a Frobenius norm, and λo>0.
For example, a magnitude (l2-norm) minimization and energy minimization may be applied to parameter vectors that do not have hierarchical information. In this example, the magnitude minimization may be performed by arg minw λfΣk∥wk∥ in which wk∈W and λf>0. The energy minimization may be performed by arg minw Σi≠jλcd(wi,wj) in which λc>0. The energy minimization may be referred to as a “pairwise distance minimization”.
The constraint term l described in the right side of Equation 3 helps in constructing geometry-aware relational parameter vectors between different spheres.
Multiple constraint conditions are defined as l:=Σkλkl,k in which l,k denotes a k-th constraint condition between parameter vectors at the l-th level and (l−1)-th level, and λ>0 denotes a Lagrange multiplier.
For example, three constraint conditions may be applied in a geometric point of view. The three constraint conditions are defined below.
1. Constraint condition 1 C1: describes that a radius of an l-th inner sphere is less than a radius of an (l−1)-th outer sphere as shown in the following equation:
r
(l−1)
−r
(l)≥0⇒∥w(l−1)−w(l)∥=∥ws(l−1)−wc(l−1)∥−∥ws(l)−wc(l)∥≥0.
2. Constraint condition 2 C2: describes that a center of an l-th inner sphere is located in an (l−1)-th outer sphere as shown in the following equation:
r
(l−1)−(∥wc(l,l−1)∥+r(l)≥0⇒r(l−1)−(∥wc(l−1,0)−wc(l,0)∥+r(l))=∥ws(l−1,0)−wc(l−1)∥−(∥wc(l−1)−wc(l)∥+∥ws(l)−wc(l)∥)≥0.
3. Constraint condition 3 C3: describes that a margin between spheres is greater than zero as shown in the following equation:
A discrete product metric may be suitable for the above-described groupwise definition, and projection points from parameter vectors formed in a discrete metric space may be isolated from each other.
The discrete distance may be determined such that a pair of vectors with the same angular distance are distributed. To maximize a distance between parameter vectors, maximization of the discrete distance may variously distribute the parameter vectors.
In
When a sign function is used in a Euclidean metric space , a discrete distance metric for vectors wi and wj may be defined as shown in Equation 7 below, for example.
In Equation 7,
denotes a normalized version of a hamming distance. For a ternary discrete, {−1,0,1} may be used for sign(x).
For example, to regard the discrete distance as an angular distance within [0, 1], a normalized distance may be defined as
An angular distance based on a product is expressed as θD
may be used. In other words, for the angular distance θD
may be applied, and 0≤Dh01≤1 may be satisfied.
The discrete distance may be limited to approximate a model distribution.
A discrete distance metric may be merged with a continuous angular distance metric
into a single metric.
For example, a definition of Pythagorean means including an arithmetic mean (AM), a geometric mean (GM) and a harmonic mean (HM) may be used to merge the discrete distance metric with the continuous angular distance metric.
Pythagorean means using the above-described angle pair may be defined as shown in Equation 8 below, for example.
In an angular distance using {θD
may be adopted to maximize an angle in an optimization formulation as a form of minimization instead of (⋅)−s. In 0≤θ≤1, an angle and its cosine value show an inverse relationship, for example, 0≤θ≤1→1≥cos θπ≤−1. Here, s=1, 2, . . . is used in a Thomson problem that utilizes s-energy.
A cosine similarity of the above angles may be defined as shown in Equation 9 below, for example.
In Equation 9, cosine similarity functions may be normalized with
to have a distance value within [0,1].
Pythagorean means of a cosine similarity may be calculated as shown in Equation 10 below, for example.
Metrics defined in Equations 8, 9 and 10 satisfy three metric conditions, that is, non-negativity, symmetry and triangle inequality.
A distance using the above-described metrics between two points may be limited, because a hypersphere is a compact manifold.
Since a sign function is not differentiable at a value of “0”, a backpropagation function instead of the sign function may be used. For a sign function in a discrete metric, a straight-through estimator (STE) may be adopted in a backward path of a neural network.
A derivative of the sign function is substituted with 1|w|≤1 that is known as a saturated STE in the backward path.
A derivative of
is not defined at a value of x=±1, and accordingly x∈[−0.99,0.99] may be obtained by applying clamping to a cosine function. Also, x=cos(θπ), 0≤θ≤1 may be satisfied.
When a dimensionality of a vector increases, a probability of increasing a sparsity of the vector may also increase. A Euclidean distance may be (|x−y|{circumflex over ( )}2=|x|{circumflex over ( )}2+|y|{circumflex over ( )}2−2x·y). When two parameter vectors are similar, for example, (x·y≈0), there is a technological problem in that it may be difficult to reflect a similarity between the two parameter vectors due to magnitude values (|x|{circumflex over ( )}2+|y|{circumflex over ( )}2) of the two parameter vectors.
Since a cosine distance is calculated after a parameter vector is projected to a unit sphere (|x−y|2=2−2x·y), a noise effect may decrease. However, since a search space increases when searching for parameter vectors with an even distribution in a spherical space, there is a technological problem in that an optimization may not be achieved. Thus, one or more embodiments of the present disclosure may solve such technological problem and achieve optimization by using a distance space obtained by reducing the search space.
In one or more embodiments of the present disclosure, a continuous value in a Euclidean space may be mapped to, for example, a binary or ternary discrete value, and thus a uniform parameter vector distribution may be stably trained.
In one or more embodiments of the present disclosure, when a parameter vector is searched for in a discretized space as shown in
The encoder 410 may extract a feature vector of input data.
The coarse segmenter 420 may output a coarse label of the feature vector through a loss function L and a regularization function R. The coarse segmenter 420 may perform a regularization between an upper level and a lower level by Equation 3 described above, and the coarse label may correspond to the above-described center vector, for example.
The fine classifier 430 may output a fine label of the feature vector through the loss function L and the regularization function R. The fine classifier 430 may perform a regularization between same levels by Equation 4 described above, and the fine label may correspond to the above-described surface vector, for example.
The relationship regularizer 440 may perform a regularization by a relationship between the coarse label and the fine label. A regularization result by a relationship R(c,f) of the relationship regularizer 440 may correspond to l of Equation 3, and a constraint on a relationship between spheres which indicates how the relationship between spheres is to be formed.
For example, a regularization may be expressed as R=Rf+R(c,f)+(Rc), which corresponds to Equations 3 and 4, for example.
A label at every layer in a hierarchical structure may be trained by the relationship R(c,f) between the coarse label and the fine label, and a regularization at the last layer may be performed by Rf.
A regularization may be performed by maximizing a distance (for example,
between parameter vectors, or by minimizing energy between parameter vectors.
A regularization reflecting hierarchical information may also be performed by a regularization of a representative parameter vector for each group reflecting statistical characteristics (for example, a mean) of parameter vectors for each group.
A label of R(c,f) representing a relationship may be obtained through clustering of self-supervised learning or semi-supervised learning. A hierarchical parameter vector (obtained by combining a coarse parameter vector corresponding to the coarse label and a fine parameter vector corresponding to the fine label) may be applied to a neural network and input data may be processed using the neural network to which the hierarchical parameter vector is applied.
The input image 510 may be represented by the coarse parameter vector 520 and the fine parameter vector 530 through a hierarchical-hyperspherical space that includes a plurality of spheres belonging to different layers. The hierarchical parameter vector 540 (obtained by combining the coarse parameter vector 520 and the fine parameter vector 530) may be applied to a neural network, and input data (e.g. the input image 510) may be processed, and accordingly the feature 550 corresponding to the input image 510 may be output. For example, the feature 550 may be generated by performing a convolution operation based on the input image 510 (or a feature vector generated based on the input image 510), using the neural network to which the hierarchical parameter vector 540 is applied.
The generator may form, or represent, a multilayer neural network. Also, a recognizer or a generator in a layered representation may be generated by a combination of the above-described coarse parameter vector and fine parameter vector.
The generator, configured to generate an image, may be utilized through the generation of the layered noise vector.
In operation 720, the data processing apparatus may acquire or obtain (e.g., from a memory) a plurality of parameter vectors representing a hierarchical-hyperspherical space that includes a plurality of spheres belonging to different layers. The plurality of parameter vectors may correspond to, for example, the above-described projection vector w or a projection parameter vector. Each of the plurality of parameter vectors may include a center vector wc indicating a center of a corresponding sphere and a surface vector ws indicating a surface of the surface.
Centers of spheres belonging to the same layer in the hierarchical-hyperspherical space may be determined based on, for example, a center of a sphere belonging to an upper layer of the same layer. For example, both a center vector and a surface vector at a current level may be based on a center vector at a previous level. The hierarchical-hyperspherical space may satisfy constraint conditions described below. A radius of a sphere belonging to a predetermined layer in the hierarchical-hyperspherical space may be less than a radius of a sphere belonging to an upper layer of the predetermined layer. A center of a sphere belonging to a predetermined layer may be located in the sphere belonging to an upper layer of the predetermined layer, and spheres belonging to the same layer in the hierarchical-hyperspherical space may not overlap each other.
A distribution of the plurality of parameter vectors, which indicates a degree by which the plurality of parameter vectors are globally and uniformly distributed in the hierarchical-hyperspherical space, may be greater than a threshold distribution. The distribution may be determined based on, for example, a combination of a discrete distance between the plurality of parameter vectors and a continuous distance between the plurality of parameter vectors. The discrete distance may be determined by quantizing the plurality of parameter vectors and calculating a hamming distance between the quantized parameter vectors. The discrete distance may correspond to, for example, the discrete distance Dh of
The continuous distance may include an angular distance between the plurality of parameter vectors. The continuous distance may correspond to, for example, the angular distance Da of
In operation 730, the data processing apparatus may apply the plurality of parameter vectors to generate the neural network. The neural network may include, for example, a convolutional neural network (CNN), and the plurality of parameter vectors may include a plurality of filter parameter vectors. For example, the data processing apparatus may generate a projection vector based on a center vector and a surface vector corresponding to each of the plurality of parameter vectors, and may apply the projection vector to generate the neural network. In this example, the center vector and the surface vector may correspond to a center vector and a surface vector of a sphere belonging to a level or layer of one of the plurality of spheres included in the hierarchical-hyperspherical space. For example, when a current level is l, a center vector indicating a center of a sphere with the level l may correspond to the above-described wc(l), and a surface vector indicating a surface of the sphere with the level l may correspond to the above-described ws(l).
In operation 740, the data processing apparatus may process the input data based on the generated neural network to which the plurality of parameter vectors are applied in operation 730. In an example, the processing of the input data using the generated neural network may include performing recognition of the input data.
In operation 820, the training apparatus may process the training data based on a neural network. The neural network may include, for example, a CNN, and a plurality of parameter vectors of the neural network may include a plurality of filter parameter vectors. Each of the plurality of parameter vectors may include a center vector indicating a center of a corresponding sphere and a surface vector indicating a surface of the sphere.
In operation 830, the training apparatus may determine a loss term, for example, , based on a label of the training data and a result obtained by processing the training data.
In operation 840, the training apparatus may determine a regularization term, for example, , such that the parameter vectors of the neural network represent a hierarchical-hyperspherical space. The hierarchical-hyperspherical space may include a plurality of spheres belonging to different layers. Also, centers of spheres belonging to the same layer in the hierarchical-hyperspherical space may be determined based on a center of a sphere belonging to an upper layer of the same layer. In operation 840, the regularization term may be determined based on any one or any combination of a first constraint condition in which a radius of a sphere belonging to a predetermined layer in the hierarchical-hyperspherical space is less than a radius of a sphere belonging to an upper layer of the predetermined layer, a second constraint condition in which a center of a sphere belonging to a predetermined layer is located in a sphere belonging to an upper layer of the predetermined layer, and a third constraint condition in which spheres belonging to the same layer in the hierarchical-hyperspherical space do not overlap each other.
For example, the regularization term may be determined such that a distribution of the plurality of parameter vectors is greater than a threshold distribution. The distribution may indicate a degree by which the plurality of parameter vectors are globally and uniformly distributed in the hierarchical-hyperspherical space, that is, indicates a degree A of a regularization. The distribution may be determined based on, for example, a combination of a discrete distance between the plurality of parameter vectors and a continuous distance between the plurality of parameter vectors. The discrete distance may be determined by quantizing the plurality of parameter vectors and calculating a hamming distance between the quantized parameter vectors. The continuous distance may include an angular distance between the plurality of parameter vectors.
Also, the regularization term may be determined based on, for example, any one or any combination of a first distance term based on a distance between center vectors of spheres belonging to the same layer in the hierarchical spherical space, a second distance term based on a distance between surface vectors of spheres belonging to the same layer in the hierarchical spherical space, a third distance term based on a distance between center vectors of spheres belonging to different layers in the hierarchical spherical space, and a fourth distance term based on a distance between surface vectors of spheres belonging to different layers in the hierarchical spherical space.
In operation 850, the training apparatus may train the parameter vectors based on the loss term determined in operation 830 and the regularization term determined in operation 840.
The communication interface 910 may receive input data. The communication interface 910 may receive the input data from the image sensor 940. The image sensor 940 may acquire or capture the input data when the input data is image data. The image sensor 940 may be an optic sensor such as a camera. The communication interface 910 may acquire a plurality of parameter vectors representing a hierarchical-hyperspherical space that includes a plurality of spheres belonging to different layers.
The processor 920 may apply the plurality of parameter vectors to a neural network and processes the input data based on the neural network.
Also, the processor 920 may perform at least one of the methods described above with reference to
The processor 920 may execute a program and control the data processing apparatus 900. Codes of the program executed by the processor 920 may be stored in the memory 930.
The memory 930 may store a variety of information generated in a processing process of the above-described processor 920. Also, the memory 930 may store a variety of data and programs. The memory 930 may include, for example, a volatile memory or a non-volatile memory. The memory 930 may include a high-capacity storage medium such as a hard disk to store a variety of data.
The apparatuses, units, modules, devices, encoders, course segmenters, fine classifiers, relationship regularizers, optimizers, generators, data processing apparatuses, communication buses, communication interfaces, processors, memories, image sensors, encoder 410, course segmenter 420, fine classifier 430, relationship regularizer 440, optimizer 450, generator, data processing apparatus 900, communication bus 905, communication interface 910, processor 920, memory 930, image sensor 940, and other components described herein with respect to
The methods illustrated in
Instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above may be written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the one or more processors or computers to operate as a machine or special-purpose computer to perform the operations that are performed by the hardware components and the methods as described above. In one example, the instructions or software include machine code that is directly executed by the one or more processors or computers, such as machine code produced by a compiler. In another example, the instructions or software includes higher-level code that is executed by the one or more processors or computer using an interpreter. The instructions or software may be written using any programming language based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions used herein, which disclose algorithms for performing the operations that are performed by the hardware components and the methods as described above.
The instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, may be recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access programmable read only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMs, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, blue-ray or optical disk storage, hard disk drive (HDD), solid state drive (SSD), flash memory, a card type memory such as multimedia card micro or a card (for example, secure digital (SD) or extreme digital (XD)), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other device that is configured to store the instructions or software and any associated data, data files, and data structures in a non-transitory manner and provide the instructions or software and any associated data, data files, and data structures to one or more processors or computers so that the one or more processors or computers can execute the instructions. In one example, the instructions or software and any associated data, data files, and data structures are distributed over network-coupled computer systems so that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by the one or more processors or computers.
While this disclosure includes specific examples, it will be apparent after an understanding of the disclosure of this application that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents. Therefore, the scope of the disclosure is defined not by the detailed description, but by the claims and their equivalents, and all variations within the scope of the claims and their equivalents are to be construed as being included in the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2019-0150527 | Nov 2019 | KR | national |
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application No. 62/903,983 filed on Sep. 23, 2019, in the U.S. Patent and Trademark Office, and claims the benefit under 35 U.S.C. § 119(a) of Korean Patent Application No. 10-2019-0150527 filed on Nov. 21, 2019, in the Korean Intellectual Property Office, the entire disclosures of which are incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
62903983 | Sep 2019 | US |