This disclosure relates to the field of artificial intelligence, and specifically, to a feature extraction method and apparatus.
Computer vision is an integral part of various intelligent/autonomic systems in various application fields such as manufacturing, inspection, document analysis, medical diagnosis, and military affairs. Computer vision is knowledge about how to use a camera/video camera and a computer to obtain required data and information of a photographed object. Figuratively, an eye (a camera/video camera) and a brain (an algorithm) are mounted on a computer to replace human eyes to recognize, track, and measure a target, so that the computer can perceive an environment. Perceiving may be considered as extracting information from a sensory signal. Therefore, computer vision may also be considered as a science of studying how to make an artificial system perceive an image or multidimensional data. Generally, computer vision is to replace a visual organ with various imaging systems to obtain input information, and then replace a brain with a computer to process and interpret the input information. A final study objective of computer vision is to enable a computer to observe and understand the world through vision in a way that human beings do, and have a capability of automatically adapting to an environment.
With development of computer vision, more tasks including image classification, 2D detection, semantic segmentation, key point detection, linear object detection (for example, lane line or stop line detection in a self-driving technology), drivable area detection, scene recognition, and the like can be executed by using a visual perception model. A problem of great concern is how to enable the visual perception model to better complete a target task, so that performance and an effect of the visual perception model are better.
This disclosure provides a feature extraction method and apparatus, so that an extracted feature of a to-be-processed object can better characterize the to-be-processed object, thereby improving performance of a model to which the feature extraction method is applied.
To resolve the foregoing technical problem, the following technical solutions are provided in embodiments of this disclosure:
According to a first aspect, this disclosure provides a feature extraction method. The method may include: Feature extraction is performed on a first vector by using a first feature extraction model, to obtain a first feature. The first vector indicates a first segmented object, and the first segmented object may include some elements in a to-be-processed object. A data type of the to-be-processed object may be image data, text data, or voice data. It may be understood that the segmented to-be-processed object includes some elements in the to-be-processed object. When the to-be-processed object is an image, some elements in the to-be-processed object are some pixels in the image; or when the to-be-processed object is a text or a voice, some elements in the to-be-processed object are characters or words in the text or the voice. Feature extraction is performed on a second vector by using a second feature extraction model, to obtain a plurality of second features, where the second vector indicates some elements in the first segmented object.
At least two second features are fused based on a first target weight, to obtain a first fused feature, where the first target weight is determined based on a first parameter value, and the first target weight is positively correlated with the first parameter value. The first parameter value indicates a similarity between each of the at least two second features and a target second feature, and the target second feature is any one of the at least two second features. Alternatively, the first target weight is a second parameter value, and the second parameter value includes at least one preset constant. A similarity between one or more second features and the target second feature may be measured in different manners. For example, the similarity may be measured by using a value of an inner product between two second features. A larger inner product between the two second features indicates a higher similarity between the two second features, and a greater weight, that is, the two features have great impact on each other. For example, it is assumed that the second features include a feature A, a feature B, and a feature C. When the target second feature is the feature B, it is assumed that an inner product between the feature A and the feature B is greater than an inner product between the feature A and the feature C, which represents a larger similarity between the feature A and the feature B. In this case, the feature A has greater impact on the feature B, and the feature C has smaller impact on the feature B. Weights may be respectively set to 0.9 and 0.1. In this case, the fusing at least two second features based on a first target weight may be understood as 0.9*A+B+0.1*C, and this result represents one first fused feature. It should be noted that using an inner product to represent a similarity between two features is only a manner of measuring the similarity between the two features. The similarity between the two features may also be measured in another manner. For example, a neural network model may be trained, so that a trained neural network is obtained to obtain the similarity between the two features. Alternatively, the first target weight may be preset. For example, each of the at least two second features may be set to have same impact on the target second feature. In this case, the target second feature and one or more other second features may be averaged, and an average value is added to the target second feature. It should be noted that the foregoing does not list all manners of measuring impact of one or more second features on the target second feature. In addition to the foregoing several measurement manners, other manners may be used for measurement.
Fusion processing is performed on the first feature and the first fused feature to obtain a second fused feature, where the second fused feature is used to obtain a feature of the to-be-processed object. The second fused feature is used to determine a final feature of the to-be-processed object. In an embodiment, a second fused feature output by a last feature extraction module of a plurality of feature extraction modules in the first feature extraction model is used to determine a final extracted feature of the to-be-processed object. For each segmented object, the last feature extraction module outputs a corresponding second fused feature, and a set of second fused features is the final feature of the to-be-processed object. In an embodiment, weighting processing is performed on a second fused feature that corresponds to each segmented object and that is output by the last feature extraction module, and a result of the weighting processing is used as the final feature of the to-be-processed object.
It can be learned from the solution provided in the first aspect that an association relationship between elements is established by using the second feature extraction model, where the association relationship is implicitly included in the first fused feature. The first fused feature and the first feature are fused, so that the extracted feature includes an association relationship between elements and can better the to-be-processed object. If the extracted feature of the to-be-processed object can represent more information of the to-be-processed object, it is more helpful for the model to analyze the to-be-processed object.
In an embodiment of the first aspect, the method may further include: obtaining a third feature, where the third feature is obtained by performing feature extraction on a third vector by using the first feature extraction model. The third vector indicates a second segmented object, and the second segmented object may include some elements in the to-be-processed object. That fusion processing is performed on the first feature and the first fused feature to obtain a second fused feature may include: fusing the first feature and the third feature based on a second target weight, to obtain a third fused feature, where the second target weight is determined based on a third parameter value, and the third parameter value indicates a similarity between the third feature and the first feature, or the second target weight is a fourth parameter value, and the fourth parameter value includes at least one preset constant; and performing fusion processing on the third fused feature and the first fused feature to obtain the second fused feature. In an embodiment, an association relationship between segmented objects is established by using the first feature extraction model. In a process of extracting the feature of the to-be-processed object, the association relationship between segmented objects is retained, and an association relationship between elements is retained, so that the extracted feature can represent the feature of the to-be-processed object. Further, performance of a model to which the feature extraction method is applied can be improved.
In an embodiment of the first aspect, the first vector indicates the first segmented object carrying first position information, and the first position information is position information of the first segmented object in the to-be-processed object. An example in which the to-be-processed object is an image is used for description. The first position information may be represented by using coordinate information of one pixel, or may be represented by using coordinate information of a plurality of pixels. For example, when the to-be-processed image is evenly segmented to obtain a plurality of image blocks, position information of each image block may be represented by coordinates of a pixel in an upper left corner of each image block. For another example, when each image block is a regular rectangle or square, the position information of each image block may be represented by coordinates of a pixel in an upper left corner and coordinates of a pixel in a lower right corner of each image block. Alternatively, the first position information may be represented by using a coding vector. In an embodiment, the first vector includes more information, that is, the first position information, so that the first feature extraction model can obtain more information. More information obtained by the first feature extraction model can be more helpful for the first feature extraction model to learn, so as to better extract an image feature.
In an embodiment of the first aspect, each second vector indicates some elements in the first segmented object carrying second position information, and the second position information is position information, in the first segmented object, of some elements in the first segmented object. In an embodiment, the second vector includes more information, that is, the second position information. More information obtained by the second feature extraction model can be more helpful for the second feature extraction model to learn, so as to better extract an image feature.
In an embodiment of the first aspect, that fusion processing is performed on the first feature and the first fused feature to obtain a second fused feature may include: performing end-to-end concatenation processing on the first feature and the first fused feature to obtain the second fused feature. In an embodiment, a manner of performing fusion processing on the first feature and the first fused feature is provided, thereby increasing diversity of the solution.
In an embodiment of the first aspect, that fusion processing is performed on the first feature and the first fused feature to obtain a second fused feature may include: performing a target operation on the first feature and the first fused feature to obtain the second fused feature, where the target operation may include at least one of addition or multiplication. In an embodiment, a manner of performing fusion processing on the first feature and the first fused feature is provided, thereby increasing diversity of the solution.
In an embodiment of the first aspect, that a target operation is performed on the first feature and the first fused feature to obtain the second fused feature may include: when there are a plurality of first fused features, performing end-to-end concatenation processing on the plurality of first fused features to obtain a concatenated feature; mapping the concatenated feature to a feature of a target length, where the target length is determined based on a length of the first feature; and performing addition processing on the first feature and the feature of the target length to obtain the second fused feature. In an embodiment, a manner of performing fusion processing on the first feature and the first fused feature is provided, thereby increasing diversity of the solution.
In an embodiment of the first aspect, that at least two second features are fused based on a first target weight, to obtain a first fused feature may include: inputting the at least two second features into a target model, where an output of the target model is the first fused feature, the target model may include one of a self-attention network transformer, a convolutional neural network (CNN), or a recurrent neural network (RNN), and when the target model is the transformer, the first target weight is determined based on an inner product between each of the at least two second features and the target second feature, or when the target model is the CNN or the RNN, the first target weight is preset. In an embodiment, several manners of obtaining the first fused feature are provided, thereby increasing diversity of the solution.
In an embodiment of the first aspect, the to-be-processed object is a to-be-processed image, the first vector indicates a first segmented image, the first segmented image may include some pixels in the to-be-processed image, the second vector indicates some pixels in the first segmented image, and the second fused feature is used to obtain a feature of the to-be-processed image. In an embodiment, the to-be-processed object is a to-be-processed image. In a process of extracting an image feature, an association relationship between image blocks is retained, and an association relationship between pixels (or pixel blocks) is retained. Therefore, a color feature, a texture feature, a shape feature, a spatial relationship feature, and the like of the image can be well captured based on the extracted image feature, and further, performance of a visual perception model can be improved.
According to a second aspect, this disclosure provides a feature extraction model. The feature extraction model may include a first feature extraction model and a second feature extraction model. The first feature extraction model is configured to obtain a first feature, where the first feature is obtained by performing feature extraction on a first vector by using the first feature extraction model, the first vector indicates a first segmented object, and the first segmented object may include some elements in a to-be-processed object. The second feature extraction model is configured to obtain a plurality of second features, where the second feature is obtained by performing feature extraction on a second vector by using the second feature extraction model, and the second vector indicates some elements in the first segmented object. The second feature extraction model is further configured to fuse at least two second features based on a first target weight, to obtain a first fused feature, where the first target weight is determined based on a first parameter value, and the first target weight is positively correlated with the first parameter value. The first parameter value indicates a similarity between each of the at least two second features and a target second feature, and the target second feature is any one of the at least two second features. Alternatively, the first target weight is a second parameter value, and the second parameter value includes at least one preset constant. The first feature extraction model is further configured to perform fusion processing on the first feature and the first fused feature to obtain a second fused feature, where the second fused feature is used to obtain a feature of the to-be-processed object.
In an embodiment of the second aspect, the first feature extraction model is further configured to obtain a third feature, where the third feature is obtained by performing feature extraction on a third vector by using the first feature extraction model, the third vector indicates a second segmented object, and the second segmented object may include some elements in the to-be-processed object; and the first feature extraction model is configured to: fuse the first feature and the third feature based on a second target weight, to obtain a third fused feature, where the second target weight is determined based on a third parameter value, and the third parameter value indicates a similarity between the third feature and the first feature, or the second target weight is a fourth parameter value, and the fourth parameter value includes at least one preset constant; and perform fusion processing on the third fused feature and the first fused feature to obtain the second fused feature.
In an embodiment of the second aspect, the first vector indicates the first segmented object carrying first position information, and the first position information is position information of the first segmented object in the to-be-processed object.
In an embodiment of the second aspect, each second vector indicates some elements in the first segmented object carrying second position information, and the second position information is position information, in the first segmented object, of some elements in the first segmented object.
In an embodiment of the second aspect, the first feature extraction model is configured to perform end-to-end concatenation processing on the first feature and the first fused feature to obtain the second fused feature.
In an embodiment of the second aspect, the first feature extraction model is configured to perform a target operation on the first feature and the first fused feature to obtain the second fused feature, where the target operation may include at least one of addition or multiplication.
In an embodiment of the second aspect, the first feature extraction model is configured to: when there may be a plurality of first fused features, perform end-to-end concatenation processing on the plurality of first fused features to obtain a concatenated feature; map the concatenated feature to a feature of a target length, where the target length is determined based on a length of the first feature; and perform addition processing on the first feature and the feature of the target length to obtain the second fused feature.
In an embodiment of the second aspect, the second feature extraction model is configured to input the at least two second features into a target model, where an output of the target model is the first fused feature, the target model may include one of a self-attention network transformer, a convolutional neural network CNN, or a recurrent neural network RNN, and when the target model is the transformer, the first target weight is determined based on an inner product between each of the at least two second features and the target second feature, or when the target model is the CNN or the RNN, the first target weight is preset.
In an embodiment of the second aspect, the to-be-processed object is a to-be-processed image, the first vector indicates a first segmented image, the first segmented image may include some pixels in the to-be-processed image, the second vector indicates some pixels in the first segmented image, and the second fused feature is used to obtain a feature of the to-be-processed image.
For all of implementation operations of the second aspect and the possible implementations of this disclosure and beneficial effects brought by the possible implementations, refer to descriptions in the possible implementations of the first aspect. Details are not described herein again.
According to a third aspect, this disclosure provides an image processing method. The method may include: obtaining a to-be-processed image; inputting the to-be-processed image into a visual perception model to extract an image feature by using a feature extraction model that may be included in the visual perception model, where the feature extraction model is the feature extraction model described in any one of the second aspect or the possible implementations of the second aspect; and performing visual perception on the to-be-processed image based on the image feature.
In an embodiment of the third aspect, the performing visual perception on the to-be-processed image based on the image feature may include: classifying the to-be-processed image based on the image feature, to obtain a classification result of the to-be-processed image.
In an embodiment of the third aspect, the obtaining a to-be-processed image may include: obtaining the to-be-processed image by using a sensor of a vehicle; and the performing visual perception on the to-be-processed image based on the image feature may include: performing semantic segmentation on the to-be-processed image based on the image feature, to obtain a region in which a target object in the to-be-processed image is located, where the target object may include one or more of a person, a vehicle, and a road surface.
In an embodiment of the third aspect, the obtaining a to-be-processed image may include: obtaining the to-be-processed image by using a monitoring device; and the performing visual perception on the to-be-processed image based on the image feature may include: if the to-be-processed image recognized based on the image feature may include a person, recognizing an attribute of the person based on the image feature, where the attribute may include one or more of a gender, a complexion, an age, and clothes.
According to a fourth aspect, this disclosure provides an electronic device. The electronic device may include a processor. The processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in any one of the first aspect or the possible implementations of the first aspect is implemented.
According to a fifth aspect, this disclosure provides a computer-readable storage medium. The computer-readable storage medium may include a program. When the program is run on a computer, the computer is enabled to perform the method described in any one of the first aspect and the possible implementations of the first aspect.
According to a sixth aspect, this disclosure provides a circuit system. The circuit system may include a processing circuit, and the processing circuit is configured to perform the method described in any one of the first aspect or the possible implementations of the first aspect.
According to a seventh aspect, this disclosure provides a computer program product. The computer program product includes instructions. When the instructions are loaded and executed by an electronic device, the electronic device is enabled to perform the method described in any one of the first aspect or the possible implementations of the first aspect.
According to an eighth aspect, this disclosure provides a chip. The chip is coupled to a memory, and is configured to execute a program stored in the memory, to perform the method described in any one of the first aspect or the possible implementations of the first aspect.
For all of implementation operations of the fourth aspect to the eighth aspect and the possible implementations of this disclosure and beneficial effects brought by the possible implementations, refer to descriptions in the possible implementations of the first aspect. Details are not described herein again.
Embodiments of this disclosure provide a feature extraction method and apparatus. The solutions provided in this disclosure can improve performance and an effect of a visual perception model.
The following describes embodiments of this disclosure with reference to the accompanying drawings. One of ordinary skilled in the art may learn that, with development of technologies and emergence of a new scenario, the technical solutions provided in embodiments of this disclosure are also applicable to a similar technical problem.
To better understand the solutions provided in this disclosure, an overall working procedure of an artificial intelligence system is first described.
(1) Infrastructure
The infrastructure provides computing capability support for the artificial intelligence system, implements communication with the external world, and implements support by using a basic platform. The infrastructure communicates with the outside by using a sensor. A computing capability is provided by an intelligent chip. For example, the intelligent chip includes a hardware acceleration chip such as a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), and a field programmable gate array (FPGA). The basic platform includes related platforms, for example, a distributed computing framework and a network, for assurance and support, including cloud storage and computing, an interconnection network, and the like. For example, the sensor communicates with the outside to obtain data, and the data is provided to an intelligent chip in a distributed computing system provided by the basic platform for computing.
(2) Data
Data at an upper layer of the infrastructure indicates a data source in the field of artificial intelligence. The data relates to a graph, an image, a speech, and a text, further relates to internet of things data of a conventional device, and includes service data of an existing system and perception data such as force, displacement, a liquid level, a temperature, and humidity.
(3) Data Processing
Data processing usually includes data training, machine learning, deep learning, searching, inference, decision making, and the like.
Machine learning and deep learning may mean performing symbolic and formal intelligent information modeling, extraction, preprocessing, training, and the like on data.
Inference is a process in which human intelligent inference is simulated in a computer or an intelligent system, and machine thinking and problem resolving are performed by using formal information according to an inference control policy. A typical function is searching and matching.
Decision making is a process of making a decision after intelligent information is inferred, and usually provides functions such as classification, ranking, and prediction.
(4) General Capability
After data processing mentioned above is performed on data, some general capabilities, for example, an algorithm or a general system, such as image classification, personalized image management, personalized battery charging management, text analysis, computer vision processing, or speech recognition, may be further formed based on a data processing result.
(5) Intelligent Product and Industry Application
The intelligent product and the industry application are a product and an application of the artificial intelligence system in various fields, and are package of an overall solution of the artificial intelligence, so that decision making for intelligent information is productized and that the application is implemented. Application fields mainly include intelligent terminals, intelligent manufacturing, intelligent transportation, intelligent home, intelligent health care, intelligent security protection, self-driving, a smart city, and the like.
Embodiments of this disclosure may be applied to a plurality of application scenarios in the foregoing fields, for example, may be applied to an application scenario of natural language search, to improve accuracy of natural language search; or may be applied to an application scenario of machine translation, to make a translation result more accurate; or may be applied to an application scenario of a multi-round dialog, to improve efficiency of man-machine communication. Embodiments of this disclosure are mainly applied to application scenarios related to the computer vision field in the foregoing fields. For example, embodiments of this disclosure may be applied to application scenarios such as facial recognition, image classification, target detection, semantic segmentation, key point detection, linear object detection (for example, lane line or stop line detection in a self-driving technology), drivable area detection, and scene recognition. In an example, embodiments of this disclosure may be applied to an application scenario of self-driving. A self-driving vehicle obtains an image of an environment around the vehicle by using a camera. The image obtained by the camera is segmented, and areas in which different objects such as a road surface, a roadbed, a vehicle, and a pedestrian are located are obtained through image segmentation, so that the vehicle keeps driving in a correct area. In the self-driving field, accuracy of image segmentation is critical to safety of vehicle driving. According to the solutions provided in this disclosure, accuracy of image segmentation in the self-driving field can be improved. In another example, embodiments of this disclosure may be applied to the field of intelligent monitoring. In the field of intelligent monitoring, a key task is to perform pedestrian attribute recognition based on an image obtained by a monitoring device. A pedestrian attribute recognition task needs to be performed to recognize common attributes of a pedestrian, for example, a gender, an age, hair, clothes, and wearing. This requires that an image feature can represent more image information, for example, carry more detailed information of an image. The image feature may be obtained by inputting an image obtained by a monitoring device into a feature extraction model, and the image feature of the image is extracted by using the feature extraction model. It should be noted that, in this disclosure, the feature extraction model is sometimes referred to as a feature extraction module, and the feature extraction model and the feature extraction module have a same meaning. For example, in an example of the field of intelligent monitoring, an image obtained by a monitoring device is input into a target model, where the target model is used to perform a pedestrian attribute recognition task, the target model includes a feature extraction module, and an image feature is extracted by using the feature extraction module, so that the target model recognizes a pedestrian attribute based on the extracted image feature. According to the solutions provided in embodiments of this disclosure, performance of the feature extraction model can be improved, so that the extracted image feature can better represent image information. More image information represented by the image feature can be more helpful for improving accuracy of a visual analysis task. For the pedestrian attribute recognition task, this is more helpful for improving accuracy of pedestrian attribute recognition.
It should be understood that application scenarios of embodiments of this disclosure are not exhaustively listed herein. In the foregoing scenarios, the feature extraction method provided in embodiments of this disclosure may be used, to improve performance of the feature extraction model.
For better understanding of this solution, a system provided in an embodiment of this disclosure is first described with reference to
In a training phase, the database 230 stores a training data set. The database 230 may be represented as a storage medium in any form, and is not limited to a database in a conventional sense. The training data set may include a plurality of training samples. A data type of the training sample is not limited in this disclosure. For example, the training sample may be image data, the training sample may be voice data, or the training sample may be text data. It should be noted that data types of the training samples included in the training data set are usually the same. The training device 220 generates a first machine learning model/rule 201, and performs iterative training on the first machine learning model/rule 201 by using the training data set in the database, to obtain a mature first machine learning model/rule 201. When the training sample is image data, the first machine learning model/rule 201 is also referred to as a visual perception model in this disclosure. An example in which the training sample is image data is used to describe how to perform iterative training on the first machine learning model/rule 201 to obtain a mature first machine learning model/rule 201. When the image data is used as an input of the first machine learning model/rule 201, the first machine learning model/rule 201 extracts an image feature of the image data based on the feature extraction model, and the first machine learning model/rule 201 performs iterative training on the first machine learning model/rule 201 by using the extracted image feature. The training device may train the first machine learning model/rule 201 by using training data. Work at each layer of the first machine learning model/rule 201 may be described by using a mathematical expression {right arrow over (y)}=a(W·{right arrow over (x)}+b). From a physical perspective, work at each layer of a deep neural network may be understood as completing transformation from input space to output space (that is, from row space to column space of a matrix) by performing five operations on the input space (a set of input vectors). The five operations include: 1. dimension increasing or dimension reduction; 2. scaling up/down; 3. rotation; 4. translation; and 5. “bending”. The operation 1, the operation 2, and the operation 3 are performed by W·{right arrow over (x)}, the operation 4 is performed by +b, and the operation 5 is performed by a( ). The word “space” is used herein for expression because a classified object is not a single thing, but a type of thing. Space is a set of all individuals of this type of thing. W is a weight vector, and each value in the vector indicates a weight value of one neuron at this layer of the neural network. The vector W determines space transformation from the input space to the output space described above. In other words, a weight W at each layer controls how to transform space. A purpose of training the first machine learning model/rule 201 is to finally obtain a weight matrix (a weight matrix formed by vectors W at a plurality of layers) at all layers of the trained first machine learning model/rule 201. Therefore, the training process of the first machine learning model/rule 201 is essentially a manner of learning control of space transformation, and more specifically, learning a weight matrix.
It is expected that an output of the first machine learning model/rule 201 is as close as possible to an expected value that is really desired. An expected value that is really desired is related to a training goal of the first machine learning model/rule 201 or a task that needs to be completed by the first machine learning model/rule 201. For example, if the first machine learning model/rule 201 is used to perform an image classification task, an output of the first machine learning model/rule 201 is as close as possible to a real image classification result. It should be noted that this disclosure focuses on how to enable the feature extracted by the first machine learning model/rule 201 to better represent information about a to-be-processed object. A task performed by the first machine learning model/rule 201 based on the extracted feature is not limited in this disclosure. To make the output of the first machine learning model/rule 201 as close as possible to an expected value that is really desired, a current predicted value of the network may be compared with a target value that is really desired, and then a weight vector at each layer of the neural network is updated based on a difference between the current predicted value and the target value (certainly, there is usually an initialization process before the first update, that is, a parameter is preconfigured for each layer of the deep neural network). For example, if the predicted value of the network is large, the weight vector is adjusted to reduce the predicted value, and the adjustment is performed continuously until the neural network can predict the target value that is really desired. Therefore, “how to obtain, through comparison, a difference between the predicted value and the target value” needs to be predefined. This is a loss function or an objective function. The loss function and the objective function are important equations that measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the first machine learning model/rule 201 is a process of minimizing the loss as much as possible.
In an inference phase, the execution device 210 may invoke data, code, and the like in the data storage system 240, or may store data, instructions, and the like in the data storage system 240. The data storage system 240 may be configured in the execution device 210, or may be a memory outside the execution device 210. The execution device 210 may invoke the mature first machine learning model/rule 201 to extract a feature of the to-be-processed object, and perform a task based on the extracted feature of the to-be-processed object. A data type of the to-be-processed object is generally the same as a data type of the training sample. A task is determined based on a training task in the training phase. For example, in the training phase, iterative training is performed on the first machine learning model/rule 201 by using the training data set in the database, so that the mature first machine learning model/rule 201 can extract a feature from an image, and perform an image classification task based on the extracted feature. In the inference phase, the execution device 210 may invoke the mature first machine learning model/rule 201 to extract a feature of an image, and perform an image classification task based on the extracted image feature.
In some embodiments of this disclosure, for example, in
In some other embodiments of this disclosure, the execution device 210 and the client device may be independent devices. The execution device 210 is configured with an input/output interface, to exchange data with the client device. The “user” may input at least one task to the execution device 210 by using the input/output interface of the client device, and the execution device 210 returns a processing result to the client device by using the input/output interface.
A process of extracting a feature of a to-be-processed object is involved in both a process of performing iterative training on the first machine learning model/rule 201 and when the mature first machine learning model/rule 201 is used to perform a task. Therefore, the solutions provided in this disclosure may be performed by the training device 220 or the execution device 210.
Currently, some first machine learning models/rules 201 require an input of a one-dimensional vector. For example, a self-attention network (for example, Transformer), a long short term memory (LSTM) neural network, and a gated recurrent unit network (GRU) require an input of a one-dimensional vector. However, the to-be-processed object is usually a multidimensional tensor. For example, an image is usually a three-dimensional tensor. Therefore, the to-be-processed object needs to be preprocessed, and the tensor needs to be converted into a vector before being used as an input of these models. The applicant finds that some solutions to preprocessing the to-be-processed object damage an internal structure of the to-be-processed object, and consequently, the extracted feature of the to-be-processed object loses detailed information, which is disadvantageous for correct prediction of these models. The following describes disadvantages of some solutions by using an example in which the to-be-processed object is an image and the first machine learning model/rule 201 is a transformer.
Refer to
To resolve the foregoing problem, an embodiment of this disclosure provides a feature extraction method, so that a first machine learning model/rule 201 includes at least two self-attention modules, where one self-attention module is configured to establish an association relationship between image blocks, and the other self-attention module is configured to establish an association relationship between pixels, thereby improving model performance.
As shown in
401. Perform segmentation processing on a to-be-processed object to obtain a segmented to-be-processed object.
A data type of the to-be-processed object may be image data (image for short hereinafter), text data (text for short hereinafter), and voice data (voice for short hereinafter). It may be understood that the segmented to-be-processed object includes some elements in the to-be-processed object. When the to-be-processed object is an image, some elements in the to-be-processed object are some pixels in the image. When the to-be-processed object is a text or a voice, some elements in the to-be-processed object are characters or words in the text or the voice. In an embodiment, the to-be-processed object in this disclosure is image data. In the following embodiment, a feature extraction method provided in this disclosure is described by using an example in which the to-be-processed object is image data. For ease of description, the segmented image is hereinafter referred to as an image block, each image block includes some pixels in the image, and all image blocks form a completed image.
In an embodiment, the image may be evenly segmented, so that each segmented image block includes a same quantity of pixels. In an embodiment, the image may not be evenly segmented, so that quantities of pixels included in all segmented image blocks are not completely the same. In an embodiment, some image blocks include a same quantity of pixels, some image blocks include different quantities of pixels, or all image blocks include different quantities of pixels. In addition, all pixels included in each image block may be adjacent pixels, or some pixels may be adjacent pixels, and some pixels are not adjacent pixels. The adjacent pixels mean that pixels in a complete image are in spatially adjacent positions. In a preferred implementation, the image may be evenly segmented, and all pixels included in each image block are adjacent pixels. An image is evenly segmented into n image blocks, which may be understood with reference to a formula 1-1.
x=[X
1
,X
2
, . . . ,X
n]∈n×p×p×3 (1-1)
X represents the to-be-processed image. Each of X1 to Xn represents a segmented image block, and n is a positive integer greater than 1, and represents a quantity of segmented image blocks. R represents a tensor, and a size of the tensor is n×p×p×3, where a size of each image block is p×p×3, p×p may represent two dimensions of the image block, and 3 represents another dimension, that is, a channel dimension. For example, if a pixel value of an image, included in each image block may be a red, green, and blue (RGB) color value, the channel dimension of the image block is 3, and the pixel value may be a long integer representing a color.
The to-be-processed image is segmented into a plurality of image blocks, which helps accelerate image feature extraction progress of a model. In other words, the model may process the plurality of image blocks in parallel, and extract image features of the plurality of image blocks simultaneously.
402. Obtain a plurality of element sets for each segmented to-be-processed object.
Each element set includes some elements in the segmented to-be-processed object. For example, for each image block, an element set is obtained, and each element set includes some pixels in each image block. For ease of description, some pixels in the image block are hereinafter referred to as a pixel block.
A plurality of pixel blocks may be obtained for each image block, and quantities of pixels included in any two of the plurality of pixel blocks may be the same or different. In addition, pixels included in each pixel block may be adjacent pixels or non-adjacent pixels. For example, reference may be made to a formula 1-2 for understanding.
Y
0
i
=[y
0
i,1
,y
0
i,2
, . . . ,y
0
i,m] (1-2), where
i=1, 2, . . . , n, and n is a positive integer greater than 1 and represents a quantity of segmented image blocks; y0i,j∈c, j=1, 2, . . . , m, where m is a positive integer greater than 1 and represents a quantity of pixel blocks included in one image block; and c represents a length of a vector corresponding to a pixel block.
In this case, there are n pixel block groups for n image blocks, which may be understood with reference to a formula 1-3.
y
0
=[Y
0
1
,Y
0
2
, . . . ,Y
0
m] (1-3)
It should be noted that pixels included in any two of the plurality of pixel blocks may overlap.
For example, the following provides a manner of obtaining an element set. The element set may be obtained in an image to column (im2col) manner. The im2col mainly converts data in each window of the image data into column vectors, and then arranges the column vectors by column to form a new matrix. The following provides descriptions with reference to
403. Perform feature extraction on a first vector by using a first feature extraction model, to obtain a first feature, and perform feature extraction on a second vector by using a second feature extraction model, to obtain a second feature.
The first vector indicates a segmented to-be-processed object. For example, the first vector indicates the image block mentioned in operation 401 and operation 402. The second vector indicates some elements in a segmented object. For example, the second vector indicates the pixel block mentioned in operation 401 and operation 402.
The first feature extraction model and the second feature extraction model may be understood as a plurality of feature extraction modules in the first machine learning model/rule 201 mentioned above. For example, the first feature extraction model and the second feature extraction model may be a CNN or an RNN. The first feature extraction model includes a plurality of feature extraction modules, and the second feature extraction model includes a plurality of feature extraction modules. For one of the first feature extraction model and the second feature extraction model, the plurality of feature extraction modules are connected end-to-end, and an output of a feature extraction module is used as an input of a next feature extraction module, so that the next feature extraction module continues to perform feature extraction. Each feature extraction module has a weight matrix, and a function of the feature extraction module in image processing is equivalent to a filter for extracting information from an input image matrix. The weight matrix is used to traverse an input, to complete a task of extracting a feature from an image. For a current feature extraction module of the first feature extraction model, an output of a feature extraction module may be considered as a first feature. The image feature mainly includes a color feature, a texture feature, a shape feature, a spatial relationship feature, and the like of the image. The color feature is a global feature, and describes a surface property of a scene corresponding to an image or an image region. The color feature is usually a pixel-based feature. In this case, all pixels belonging to the image or the image region have respective contributions. Because the color is insensitive to changes of a direction, a size, and the like of the image or the image region, the color feature cannot well capture a local feature of an object in the image. The texture feature is also a global feature, and also describes a surface property of a scene corresponding to an image or an image region. However, a texture is only a feature of a surface of an object, and cannot completely reflect an essential attribute of the object. Therefore, content of a high-layer image cannot be obtained by only using the texture feature. Different from the color feature, the texture feature is not a pixel-based feature, but needs to be statistically calculated in a region including a plurality of pixels. The shape feature is classified into two types in a representation method: One is a contour feature, and the other is a region feature. The contour feature of the image is mainly for an outer boundary of an object. The region feature of the image is related to an entire shape region. The spatial relationship feature refers to a spatial position or relative direction relationship between a plurality of objects obtained through image segmentation. These relationships may also be classified into a connection/adjacency relationship, an overlapping relationship, an inclusion relationship, and the like. Spatial position information may be usually classified into relative spatial position information and absolute spatial position information. The relative spatial position information emphasizes a relative status between targets, for example, an up and down relationship or a left and right relationship. The absolute spatial position information emphasizes a distance between targets and orientations of the targets. It should be noted that the image features listed above may be used as some examples of features in the image, and the image may further have other features, for example, a higher-level feature: a semantic feature. Details are not described herein again.
404. Fuse at least two second features based on a first target weight, to obtain a first fused feature.
For one feature extraction module in the second feature extraction model, the feature extraction module fuses the at least two second features based on the first target weight, to obtain the first fused feature. A purpose of obtaining the first fused feature is to establish an association relationship between pixel blocks. Establishing an association relationship between pixel blocks may be understood as taking impact of one or more other pixel blocks on a pixel block into account when extracting an image feature of the pixel block. If the impact of the one or more other pixel blocks on the pixel block is greater, the weight is greater. If the impact of the one or more other pixel blocks on the pixel block is smaller, the weight is smaller. The impact of the one or more pixel blocks on the pixel block may be measured in different manners. For example, the impact may be measured by using a similarity between vectors corresponding to two pixel blocks. In an embodiment, the impact may be measured by using a value of an inner product between the vectors corresponding to the two pixel blocks. A larger inner product between the vectors corresponding to the two pixel blocks indicates a higher similarity between the two pixel blocks, and a greater weight. For another example, the neural network model may be further trained, and a similarity between pixel blocks may be obtained by using the neural network model. Alternatively, a preset operation may be performed on the vectors corresponding to the two pixel blocks, and impact of another pixel block on the pixel block is obtained based on a result of the preset operation. For example, an average value may be calculated for a vector corresponding to the to-be-processed pixel block and a vector corresponding to a pixel block adjacent to the to-be-processed pixel block, and the average value is superimposed on the vector corresponding to the to-be-processed pixel block.
In a preferred implementation, the second feature extraction model may be a neural network having a self-attention mechanism. For example, the second feature extraction model may be a transformer. When the second feature extraction model is the transformer, the first fused feature may be obtained after feature extraction is performed on the second vector by using the second feature extraction model. Assuming that the second feature extraction model is the transformer and that the to-be-processed object is image data, the following describes feature extraction performed on the second vector by using the second feature extraction model, to obtain the second feature or the first fused feature. Refer to
In a process of extracting an image feature by each block, an association relationship between pixels (or pixel blocks) may be established in a plurality of manners. When each block in the second feature extraction model performs feature extraction on a pixel block, self-attention calculation is performed on a plurality of pixel blocks, and impact of each pixel block on the currently processed pixel block is taken into account.
In an embodiment, the first feature extraction model and the second feature extraction model are the foregoing models that require an input of a one-dimensional vector. For example, the first feature extraction model may be one of the transformer, a GRU, and an LSTM, and the second feature extraction model may be one of the transformer, the GRU, and the LSTM. As mentioned above, these feature extraction models require an input of a one-dimensional vector. Therefore, for these models, after the image block and the pixel block are obtained in operation 401 and operation 402, the image block further needs to be converted into a vector representation, the image block represented by the vector is used as an input of the first feature extraction model, the pixel block is converted into a vector representation, and the pixel block represented by the vector is used as an input of the second feature extraction model. There may be a plurality of representations for converting an image block into a vector representation. For example, refer to
405. Perform fusion processing on the first feature and the first fused feature to obtain a second fused feature, where the second fused feature is used to obtain a feature of the to-be-processed object.
For a feature extraction module in the second feature extraction model, fusion processing may be performed on the first feature and the first fused feature in a plurality of manners according to the solution provided in this disclosure. The following separately provides descriptions from two aspects: a fusion occasion and a fusion manner.
The fusion occasion is first described. Refer to subfigure a in
Then the fusion manner is described. In an embodiment, when there are a plurality of first fused features, end-to-end concatenation processing may be performed on the plurality of first fused features to obtain a concatenated feature. The concatenated feature is mapped to a feature of a target length, where the target length is determined based on a length of the first feature. If a length of the concatenated feature is the same as the length of the first feature, addition processing may be directly performed on the two features, that is, addition processing is performed on the first feature and the feature of the target length to obtain the second fused feature. In an embodiment, end-to-end concatenation processing is performed on the first feature and the first fused feature to obtain the second fused feature. For example, concatenation processing is performed on the first feature and the concatenated feature to obtain the second fused feature. In an embodiment, a target operation is performed on the first feature and the first fused feature to obtain the second fused feature, where the target operation includes at least one of addition or multiplication.
The second fused feature is used to determine a final feature of the to-be-processed object. In an embodiment, the second fused feature output by the last feature extraction module of the plurality of feature extraction modules in the first feature extraction model is used to determine a final extracted feature of the to-be-processed object. For each image block, the last feature extraction module outputs a corresponding second fused feature, and a set of second fused features is a final feature of the to-be-processed object. In an embodiment, weighting processing is performed on the second fused feature that corresponds to each image block and that is output by the last feature extraction module, and a result of the weighting processing is used as the final feature of the to-be-processed object.
As can be learned from the embodiment corresponding to
Alternatively, position information of an image block and a pixel block may be retained in the processing of extracting the image feature by the model, so that the color feature, the texture feature, the shape feature, the spatial relationship feature, and the like of the image can be well captured based on the image feature extracted by the first machine learning model/rule 201. The following provides descriptions with reference to an embodiment.
As shown in
901. Perform segmentation processing on a to-be-processed object to obtain a segmented to-be-processed object.
902. Obtain a plurality of element sets for each segmented to-be-processed object.
Operation 901 and operation 902 may be understood with reference to operation 401 and operation 402 in the embodiment corresponding to
903. Fuse first position information with a first vector, and fuse second position information with a second vector.
The first position information is position information of the segmented object in the to-be-processed object. For example, the first position information is position information of a segmented image block in an image. The second position information is position information of some elements in the segmented object, in the segmented object. For example, the second position information is position information of a pixel block in the image block.
The first position information may be represented by using coordinate information of one pixel or may be represented by using coordinate information of a plurality of pixels. For example, when the to-be-processed image is evenly segmented to obtain a plurality of image blocks, position information of each image block may be represented by coordinates of a pixel in an upper left corner of each image block. For another example, when each image block is a regular rectangle or square, the position information of each image block may be represented by coordinates of a pixel in an upper left corner and coordinates of a pixel in a lower right corner of each image block. It should be noted that the coordinates of the pixel in the upper left corner and the coordinates of the pixel in the lower right corner herein are merely examples for description, and are used to indicate that the first position information may be represented by using coordinate information of one pixel or coordinate information of a plurality of pixels, and does not represent a limitation on the solution provided in this disclosure.
The second position information may be represented by using coordinate information of one pixel or may be represented by using coordinate information of a plurality of pixels. In addition, because all pixels included in a pixel block may be non-adjacent pixels, position information in the pixel block may be represented by coordinates of all the pixels included in the pixel block.
The first position information and the second position information may be not only represented by using the coordinate information of the pixel, but also represented by using a coding vector. The first position information is used as an example for description. The first machine learning model/rule 201 may include a position coding module. In an initial state, the position coding module may randomly set a vector to represent position information of each image block. In a process of performing iterative training on the first machine learning model/rule 201, a parameter of the position coding module may be updated based on a loss value, so that a vector encoded by the position coding module and used to represent position information of an image block may be closer to real position information of the image block.
Fusing the first position information with the first vector and fusing the second position information with the second vector may be understood as updating Xn in the formula 1-1 and Y0i in the formula 1-3. For understanding, refer to the formulas 1˜4 and the formulas 1-5:
X
n
←X
n
+E
position-patch (1-4), and
Y
0
1
←Y
0
i
+E
position-pixel (1-5)
where Eposition-patch represents the first position information, and Eposition-pixel represents second position information.
As described in the embodiment corresponding to
904. Perform, by using the first feature extraction model, feature extraction on the first vector fused with the first position information, to obtain a first feature, and perform, by using the second feature extraction model, feature extraction on the second vector fused with the second position information, to obtain a second feature.
905. Fuse at least two second features based on a first target weight, to obtain a first fused feature.
906. Perform fusion processing on the first feature and the first fused feature to obtain a second fused feature, where the second fused feature is used to obtain a feature of the to-be-processed object.
Operation 904 to operation 906 may be understood with reference to operation 403 to operation 405 in the embodiment corresponding to
In the embodiments of
In an embodiment, the to-be-processed image may be segmented for a plurality of times. For example, in one segmentation, the to-be-processed image is segmented into four image blocks, then the four image blocks are preprocessed, and the four preprocessed image blocks meeting an input requirement of the feature extraction model 1 are used as an input of the feature extraction model 1. In another segmentation, the to-be-processed image is segmented into 16 image blocks, then the 16 image blocks are preprocessed, and the 16 preprocessed image blocks meeting an input requirement of the feature extraction model 2 are used as an input of the feature extraction model 2. In another segmentation, the to-be-processed image is segmented into 64 image blocks, then the 64 image blocks are preprocessed, and the 64 preprocessed image blocks meeting an input requirement of the feature extraction model 3 are used as an input of the feature extraction model 3. It should be noted that, in an embodiment, a plurality of feature extraction models, for example, the feature extraction model 1, the feature extraction model 2, and the feature extraction model 3, can work in parallel.
According to the solution provided in an embodiment of the disclosure, performance of the feature extraction model can be improved, so that the extracted image feature can better represent image information. More image information represented by the image feature can be more helpful for improving accuracy of a visual analysis task. The following describes the solution provided in this disclosure by using an example in which the solution provided in this disclosure is applied to several typical visual analysis tasks.
Referring to
Refer to
To more intuitively understand beneficial effects brought by this solution, the following describes beneficial effects brought by embodiments of this disclosure with reference to data. During a test, the first machine learning model/rule 201 is used to perform an image classification task.
The parameter 1 indicates a quantity of feature extraction modules included in the feature extraction model, that is, the first feature extraction model includes 12 feature extraction modules, and the second feature extraction model includes 12 feature extraction modules. The parameter 2 indicates a requirement of the second feature extraction model on an input vector length. The parameter 3 indicates a quantity of heads (multi-head self-attention) in a self-attention module in the second feature extraction model. The parameter 4 indicates a requirement of the first feature extraction model on an input vector length. The parameter 5 indicates a quantity of heads (multi-head self-attention) in the self-attention module in the first feature extraction model. The parameter 6 indicates a total quantity of parameters in the first machine learning model/rule 201 to which the feature method provided in this disclosure is applied, and a unit of the quantity of parameters is million. The parameter 7 indicates a quantity of floating point operations (FLOPs). The unit is billion.
The test data set is an ImageNet data set. An image classification test experiment is performed on the ImageNet data set.
The foregoing describes the feature extraction method provided in embodiments of this disclosure. According to the feature extraction method provided in this disclosure, the extracted feature of the to-be-processed object can better characterize the to-be-processed object, and further, performance of a model to which the feature extraction method is applied can be improved.
It may be understood that, to implement the foregoing functions, the following further provides related devices configured to implement the foregoing solutions. The related devices include corresponding hardware structures and/or software modules for performing various functions. One of ordinary skilled in the art should be easily aware that modules and algorithm operations in the examples described with reference to the embodiments disclosed in this specification can be implemented by hardware or a combination of hardware and computer software in this disclosure. Whether a function is performed by hardware or by driving hardware by computer software depends on particular applications and design constraints of the technical solutions. One of ordinary skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this disclosure.
The first obtaining module 1501 is configured to obtain a first feature, and the second obtaining module 1502 is configured to obtain a plurality of second features, where the first feature is obtained by performing feature extraction on a first vector by using a first feature extraction model, the first vector indicates a first segmented object, the first segmented object includes some elements in a to-be-processed object, the second feature is obtained by performing feature extraction on a second vector by using a second feature extraction model, and the second vector indicates some elements in the first segmented object; the first fusion module 1503 is configured to fuse at least two second features based on a first target weight, to obtain a first fused feature, where the first target weight is determined based on impact of each second feature of the at least two features on a target second feature, and the target second feature is any one of the at least two second features; and the second fusion module 1504 is configured to perform fusion processing on the first feature and the first fused feature to obtain a second fused feature, where the second fused feature is used to obtain a feature of the to-be-processed object.
In an embodiment, the first obtaining module 1501 is further configured to obtain a third feature, where the third feature is obtained by performing feature extraction on a third vector by using the first feature extraction model, the third vector indicates a second segmented object, and the second segmented object includes some elements in the to-be-processed object; the third fusion module 1505 is configured to fuse the first feature and the third feature based on a second target weight, to obtain a third fused feature, where the second target weight is determined based on impact of the third feature on the first feature; and the second fusion module 1504 is configured to perform fusion processing on the third fused feature and the first fused feature to obtain the second fused feature.
In an embodiment, the first vector indicates the first segmented object carrying first position information, and the first position information is position information of the first segmented object in the to-be-processed object.
In an embodiment, each second vector indicates some elements in the first segmented object carrying second position information, and the second position information is position information, in the first segmented object, of some elements in the first segmented object.
In an embodiment, the second fusion module 1504 is configured to perform end-to-end concatenation processing on the first feature and the first fused feature to obtain the second fused feature.
In an embodiment, the second fusion module 1504 is configured to perform a target operation on the first feature and the first fused feature to obtain the second fused feature, where the target operation includes at least one of addition or multiplication.
In an embodiment, the second fusion module 1504 is configured to: when there are a plurality of first fused features, perform end-to-end concatenation processing on the plurality of first fused features to obtain a concatenated feature; map the concatenated feature to a feature of a target length, where the target length is determined based on a length of the first feature; and perform addition processing on the first feature and the feature of the target length to obtain the second fused feature.
In an embodiment, the first fusion module 1503 is configured to input the at least two second features into a target model, where an output of the target model is the first fused feature, the target model includes one of a self-attention network transformer, a CNN, or a RNN, and when the target model is the transformer, the first target weight is determined based on an inner product between each of the at least two second features and the target second feature, or when the target model is the CNN or the RNN, the first target weight is preset.
In an embodiment, the to-be-processed object is a to-be-processed image, the first vector indicates a first segmented image, the first segmented image includes some pixels in the to-be-processed image, the second vector indicates some pixels in the first segmented image, and the second fused feature is used to obtain a feature of the to-be-processed image.
In an embodiment, the electronic device may be the training device 220 described in
It should be noted that content such as information exchange and an execution process between modules in the electronic device shown in
An embodiment of this disclosure further provides an electronic device.
The electronic device 1400 may further include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server™, Mac OS X™, Linux™, or FreeBSD™.
It should be noted that the central processing unit 1422 is further configured to perform other operations performed by the first machine learning model/rule 201 in
An embodiment of this disclosure further provides an electronic device.
The memory 1504 may include a read-only memory and a random access memory, and provide instructions and data for the processor 1503. A part of the memory 1504 may further include a non-volatile random access memory (NVRAM). The memory 1504 stores data and operation instructions, an executable module or a data structure, or a subset thereof or an extended set thereof. The operation instructions may include various operation instructions for implementing various operations.
The processor 1503 controls an operation of the electronic device. In an embodiment, the components of the electronic device are coupled together through a bus system. In addition to a data bus, the bus system may further include a power bus, a control bus, a status signal bus, and the like. However, for clear description, various types of buses in the figure are marked as the bus system.
The methods disclosed in the foregoing embodiments of this disclosure may be applied to the processor 1503, or may be implemented by the processor 1503. The processor 1503 may be an integrated circuit chip with a signal processing capability. In an embodiment, the operations in the foregoing methods may be implemented by using a hardware integrated logical circuit in the processor 1503, or by using instructions in a form of software. The processor 1503 may be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The processor 1503 may implement or perform the methods, operations, and logical block diagrams that are disclosed in embodiments of this disclosure. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor or the like. Operations of the methods disclosed with reference to embodiments of this disclosure may be directly performed and completed by a hardware decoding processor, or may be performed and completed by using a combination of hardware and software modules in a decoding processor. The software module may be located in a mature storage medium in the art, for example, a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register. The storage medium is located in the memory 1504. The processor 1503 reads information in the memory 1504 and completes the operations in the foregoing methods in combination with hardware in the processor 1503.
The receiver 1501 may be configured to receive input digit or character information, and generate a signal input related to a related setting and function control of the execution device. The transmitter 1502 may be configured to output digit or character information through an interface. The transmitter 1502 may be further configured to send an instruction to a disk group through the interface, to modify data in the disk group. The transmitter 1502 may further include a display device such as a display screen.
In one case, in an embodiment of the disclosure, the application processor 15031 is configured to perform the method performed by the first machine learning model/rule 201 described in the embodiments corresponding to
For an implementation in which the application processor 15031 performs the functions of the first machine learning model/rule 201 in the embodiments corresponding to
It should be understood that the foregoing is merely an example provided in an embodiment of the disclosure. In addition, the vehicle may have more or fewer components than shown components, or may combine two or more components, or may have different component configurations.
All or some of the foregoing embodiments may be implemented by software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or some of the embodiments may be implemented in a form of a computer program product.
The execution device and the training device in embodiments of this disclosure may be a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor. The communication unit may be, for example, an input/output interface, a pin, or a circuit. The processing unit may execute computer-executable instructions stored in a storage unit, so that the chip performs the feature extraction method described in the embodiments shown in
In an embodiment,
In an embodiment, the operation circuit 1603 internally includes a plurality of processing units (PE). In an embodiment, the operation circuit 1603 is a two-dimensional systolic array. The operation circuit 1603 may alternatively be a one-dimensional systolic array or another electronic circuit capable of performing mathematical operations such as multiplication and addition. In an embodiment, the operation circuit 1603 is a general-purpose matrix processor.
For example, it is assumed that there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches, from the weight memory 1602, data corresponding to the matrix B, and buffers the data on each PE in the operation circuit. The operation circuit fetches data of the matrix A from an input memory 1601, performs a matrix operation on the data with the matrix B, and stores an obtained partial result or final result of the matrix in an accumulator 1608.
A unified memory 1606 is configured to store input data and output data. Weight data is directly transferred to the weight memory 1602 by using a direct memory access controller (DMAC) 1605. The input data is also transferred to the unified memory 1606 by using the DMAC.
A bus interface unit 1610 (BIU for short) is used by an instruction fetch buffer 1609 to obtain an instruction from an external memory, and further used by the direct memory access controller 1605 to obtain original data of the input matrix A or the weight matrix B from the external memory.
The DMAC is mainly configured to transfer the input data in the external memory DDR to the unified memory 1606, transfer the weight data to the weight memory 1602, or transfer the input data to the input memory 1601.
A vector calculation unit 1607 includes a plurality of arithmetic processing units. When necessary, the vector calculation unit 1607 performs further processing on an output of the operation circuit, for example, vector multiplication, vector addition, exponential operation, logarithmic operation, and value comparison. The vector calculation unit 1607 is mainly configured to perform network calculation, such as batch normalization, pixel-level summation, and upsampling on a feature map at a non-convolutional or non fully connected layer in a neural network.
In an embodiment, the vector calculation unit 1607 can store a processed output vector in the unified memory 1606. For example, the vector calculation unit 1607 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1603, for example, perform linear interpolation on a feature map extracted by a convolutional layer, or for another example, accumulate value vectors to generate an activation value. In an embodiment, the vector calculation unit 1607 generates a normalized value, a pixel-level summation value, or both. In an embodiment, the processed output vector can be used as an activation input to the operation circuit 1603, for example, for use at a subsequent layer in the neural network.
The instruction fetch buffer 1609 connected to the controller 1604 is configured to store an instruction used by the controller 1604. All of the unified memory 1606, the input memory 1601, the weight memory 1602, and the instruction fetch buffer 1609 are on-chip memories. The external memory is private for a hardware architecture of the NPU.
An operation at each layer in a recurrent neural network may be performed by the operation circuit 1603 or the vector calculation unit 1607.
Any one of the foregoing processors may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling program execution of a program of the method of the first aspect.
An embodiment of this disclosure further provides a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor. The communication unit may be, for example, an input/output interface, a pin, or a circuit. The processing unit may execute a computer-executable instruction stored in the storage unit, so that the chip performs the method described in
An embodiment of this disclosure further provides a computer-readable storage medium. The computer-readable storage medium stores a program used for training a model. When the program runs on a computer, the computer is enabled to perform the methods described in
An embodiment of this disclosure further provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform the operations in the methods described in the embodiments shown in
An embodiment of this disclosure further provides a circuit system. The circuit system includes a processing circuit, and the processing circuit is configured to perform operations in the methods described in the embodiments shown in
Based on the description of the foregoing implementations, one of ordinary skilled in the art may clearly understand that this disclosure may be implemented by pure software or by using software in addition to necessary universal hardware, or may certainly be implemented by dedicated hardware, including a dedicated integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, and the like. Usually, any function implemented by a computer program may be easily implemented by using corresponding hardware. In addition, hardware structures used to implement a same function may be various, for example, an analog circuit, a digital circuit, or a dedicated circuit. However, in this disclosure, a software program implementation is a better implementation in most cases. Based on such an understanding, the technical solutions of this disclosure essentially or the part contributing to the prior art may be implemented in a form of a software product. The computer software product is stored in a readable storage medium, such as a floppy disk, a USB flash drive, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disc of a computer, and includes several instructions for instructing a computer device (which may be a personal computer, a training device, or a network device) to perform the methods in embodiments of this disclosure. In addition, the computer software product may also be embodied in the form of controls, drivers, independent or downloadable software objects, or the like.
In the specification, claims, and the accompanying drawings of this disclosure, the terms “first”, “second”, and the like are intended to distinguish similar objects but do not necessarily indicate an order or sequence. It should be understood that the data used in such a way are interchangeable in proper circumstances so that embodiments of the present disclosure described herein can be implemented in other orders than the order illustrated or described herein. The term “and/or” in this disclosure describes only an association relationship between associated objects and represents that three relationships may exist. For example, A and/or B may represent the following three cases: Only A exists, both A and B exist, and only B exists. In addition, the character “I” in this specification generally indicates an “or” relationship between the associated objects. In addition, the terms “include”, “have”, and any other variant thereof are intended to cover a non-exclusive inclusion. For example, a process, method, system, product, or device that includes a list of operations or modules is not necessarily limited to those operations or units that are expressly listed, but may include other operations or modules that are not expressly listed or are inherent to the process, method, system, product, or device. Naming or numbering of operations in this disclosure does not mean that operations in a method process need to be performed in a time or logical sequence indicated by the naming or numbering. An execution sequence of the named or numbered process operations may be changed according to a technical objective to be achieved, provided that same or similar technical effects can be achieved. Division into the modules in this disclosure is logical division. In actual application, there may be another division manner. For example, a plurality of modules may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some ports, and the indirect coupling or communication connection between modules may be in an electrical form or another similar form. This is not limited in this disclosure. In addition, modules or sub-modules described as separate components may be or may not be physically separated, or may be or may not be physical modules, or may not be grouped into multiple circuit modules. Objectives of the solutions of this disclosure may be achieved by selecting some or all of the modules according to actual requirements.
Number | Date | Country | Kind |
---|---|---|---|
202110223032.8 | Feb 2021 | CN | national |
This disclosure is a continuation of International Application No. PCT/CN2022/077807, filed on Feb. 25, 2022, which claims priority to Chinese Patent Application No. 202110223032.8, filed on Feb. 26, 2021. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/077807 | Feb 2022 | US |
Child | 18237995 | US |