This application relates to the field of computer technologies, and in particular, to a sample data annotation system and method, and a related device.
Recently, artificial intelligence (AI) has been applied in more and more fields. Most existing AI technologies, including supervised learning, deep learning (DL), and the like, are all based on a large quantity of annotated data sets. Annotating refers to generating and/or adding a label for sample data that can represent a classification of the sample data. An existing general annotation method is to collectively annotate sample data at a central node: Each edge client node (referred to as an edge node) uploads a sample feature (including sample data or a key feature of the sample data) that needs to be annotated to the central node, and the central node annotates the sample feature. Another annotation method is to annotate the sample data at the edge node, that is, after collecting the sample feature that needs to be annotated, each edge node directly performs annotation operations locally. The two annotation manners have a defect. On one hand, the former is limited to a requirement for privacy protection. For example, in a federated learning (FL) field, sample data of an edge node (for example, a personal photo stored on a personal mobile phone terminal and a patient profile of a hospital) cannot be uploaded to a central node for annotation. On the other hand, uploading all sample features of the edge node to the central node requires a large amount of data communication, and causes an excessive communication load. In addition, all annotations need to be completed at the central node, and a bottleneck easily forms at the central node. Because the latter is limited by a condition of a low computing power of the edge node, label annotation efficiency of the edge node is not high. In addition, a label annotation method of the edge node is usually simpler than that of center annotation, which may cause a case in which a specific sample feature cannot be annotated or is incorrectly annotated.
Therefore, how to improve annotation efficiency and annotation quality of sample data becomes a research problem in the AI field.
This application provides a sample data annotation system and method, and a related device, so as to improve annotation efficiency and annotation quality of AI sample data.
According to a first aspect, the present invention provides a sample data annotation system, including an edge node and a central node, and the central node is connected to the edge node. The edge node is configured to: obtain a key feature of sample data; determine, based on the key feature, whether the sample data is unknown sample data; when the sample data is unknown sample data, perform annotation processing on the sample data to obtain a first annotation result; and upload the first annotation result to the central node. The central node is configured to: receive a first annotation result sent by the edge node; perform consistency processing on a first annotation result indicating that annotation for an unknown sample succeeds, to obtain a second annotation result; and perform annotation processing on a first annotation result indicating that annotation for an unknown sample fails, to obtain a third annotation result.
In this application, the edge node annotates the sample data, thereby improving annotation efficiency. In addition, because of an objective condition such as relatively low computing power of the edge node, a first annotation result obtained after annotating by the edge node may have an annotation error or an annotation failure. After the first annotation result is sent to the central node, the central node may correct or annotate the first annotation result again, thereby improving annotation quality.
In an optional implementation, the edge node determines, based on an unknown sample model and the key feature obtained from the sample data, whether the sample data is unknown sample data.
In an optional implementation, the unknown sample model is generated based on a plurality of key features obtained from a known annotation result set. The known annotation result set includes an annotation result obtained by the central node before the unknown sample model is generated. These annotation results include the second annotation result generated after the central node performs consistency processing and/or the third annotation result obtained after the central node performs annotation.
In an optional implementation, the unknown sample model may further include a multidimensional coordinate space, and the multidimensional coordinate space is generated by the central node based on the key feature of the known annotation result set. A coordinate distance is obtained based on the key feature of the sample data or a mapping feature of the key feature, and it is determined, based on a correlation indicated by the coordinate distance, whether the sample data is unknown sample data. If a larger coordinate distance indicates a lower correlation, the sample data is determined as unknown sample data when the correlation is higher than a preset threshold. If a lower coordinate distance indicates a lower correlation, the sample data is determined as unknown sample data when the correlation is lower than the preset threshold.
In an optional implementation, the unknown sample model may further include a neural network model. An inference (Inference) action is performed on the sample data by using the neural network model. If a correct inference result cannot be obtained, it is considered that the sample data is unknown sample data.
In an optional implementation, when the first annotation result indicates that the unknown sample data is successfully annotated, the first annotation result includes a sample identifier of the sample data, a sample feature of the sample data, and a label determined by the edge node for the sample data; or when the first annotation result indicates that the unknown sample data fails to be annotated, the first annotation result includes a sample identifier of the sample data and a sample feature of the sample data. In an optional implementation, a sample feature of the first annotation result includes the sample data and/or the key feature.
In an optional implementation, when consistency processing is performed on the first annotation result, the central node performs clustering on a plurality of first annotation results including the first annotation result through similarity division, to obtain a group corresponding to the first annotation result. Further, in an optional implementation, the central node performs clustering, in an unsupervised manner, on the plurality of first annotation results including the first annotation result, and the unsupervised manner includes one or more of the following: K-MEANS and KNN.
In an optional implementation, when labels of annotation results in the group are inconsistent, the central node performs an ensemble decision on labels of all annotation results in the group to obtain a group label, and obtains the second annotation result based on the group label. In an optional implementation, the ensemble decision includes a voting method or a weighted voting method.
In an optional implementation, the central node generates a new unknown sample model or updates the unknown sample data model based on the second annotation result and/or the third annotation result.
According to a second aspect, this application discloses a sample data annotation method, applied to an edge node of a sample annotation system. The method includes: obtaining a key feature of sample data; determining, based on the key feature, whether the sample data is unknown sample data; when the sample data is unknown sample data, performing annotation processing on the unknown sample data to obtain a first annotation result; and sending the first annotation result to a central node.
In this method, the edge node only annotates sample data that needs to be annotated, which can reduce a quantity of sample data that needs to be annotated, and sends an annotated first annotation result to the central node for further processing, thereby improving annotation quality.
In an optional implementation, the edge node determines, based on an unknown sample model and the key feature, whether the sample data is unknown sample data.
In an optional implementation, the unknown sample model is generated based on a plurality of key features obtained from a known annotation result set. The known annotation result set includes an annotation result that is successfully annotated and obtained by the central node before the unknown sample model is generated.
In an optional implementation, the unknown sample model may further include a multidimensional coordinate space, and the multidimensional coordinate space is generated by the central node of the sample data annotation system based on the key feature of the known annotation result set. A coordinate distance is obtained based on the key feature of the sample data or a mapping feature of the key feature, and it is determined, based on a correlation indicated by the coordinate distance, whether the sample data is unknown sample data. If a larger coordinate distance indicates a lower correlation, the sample data is determined as unknown sample data when the correlation is higher than a preset threshold. If a lower coordinate distance indicates a lower correlation, the sample data is determined as unknown sample data when the correlation is lower than the preset threshold.
In an optional implementation, the unknown sample model may further include a neural network model. An inference action is performed on the sample data by using the neural network model. If a correct inference result cannot be obtained, it is considered that the sample data is unknown sample data.
In an optional implementation, when the first annotation result indicates that the unknown sample data is successfully annotated, the first annotation result includes a sample identifier of the sample data, a sample feature of the sample data, and a label determined by the edge node for the sample data; or when the first annotation result indicates that the unknown sample data fails to be annotated, the first annotation result includes a sample identifier of the sample data and a sample feature of the sample data. In an optional implementation, a sample feature of the first annotation result includes the sample data and/or the key feature.
According to a third aspect, this application discloses a sample data annotation method, and the method is applied to a central node of a sample annotation system. The method includes: receiving a first annotation result sent by an edge node in the sample annotation system, where the first annotation result is obtained by the edge node by performing annotation processing on unknown sample data; and when the first annotation result indicates that the unknown sample data is successfully annotated, performing consistency processing on the first annotation result to obtain a second annotation result; or when the annotation result indicates that the unknown sample data fails to be annotated, performing annotation processing on the unknown sample data to obtain a third annotation result.
In this method, the central node performs secondary processing on the first annotation result sent by the edge node, thereby improving annotation quality of sample data.
In an optional implementation, when consistency processing is performed on the first annotation result, the central node performs clustering on a plurality of first annotation results including the first annotation result through similarity division, to obtain a group corresponding to the first annotation result. Further, in an optional implementation, the central node performs clustering, in an unsupervised manner, on the plurality of first annotation results including the first annotation result, and the unsupervised manner includes one or more of the following: K-MEANS and KNN.
In an optional implementation, when labels of annotation results in the group are inconsistent, the central node performs an ensemble decision on labels of all annotation results in the group to obtain a group label, and obtains the second annotation result based on the group label. In an optional implementation, the ensemble decision includes a voting method or a weighted voting method.
In an optional implementation, the central node generates a new unknown sample model or updates an unknown sample data model based on the second annotation result and/or the third annotation result.
According to a fourth aspect, this application discloses a node, and the node includes a function module for performing the sample annotation method provided in the second aspect or any possible design of the second aspect. In this application, division of the function module is not limited, and the function module may be correspondingly divided according to a procedure step of the sample annotation method in the second aspect, or may be divided according to a specific implementation requirement.
According to a fifth aspect, this application discloses a node, and the node includes a function module for performing the sample annotation method provided in the third aspect or any possible design of the third aspect. In this application, division of the function module is not limited, and the function module may be correspondingly divided according to a procedure step of the sample annotation method in the second aspect, or may be divided according to a specific implementation requirement.
Implementations of the different aspects of this application may be mutually combined or referenced when there is no conflict.
According to a sixth aspect, this application discloses computer program code. When instructions included in the program code are executed by a computer, the computer can implement the sample data annotation method in the second aspect or any one of the possible implementations of the second aspect.
According to a seventh aspect, this application discloses computer program code. When instructions included in the program code are executed by a computer, the computer can implement the sample data annotation method in the third aspect or any one of the possible implementations of the third aspect.
According to an eighth aspect, this application discloses a computer-readable storage medium, where the computer-readable storage medium stores computer program instructions. When the computer program instructions run on a computer, the computer performs the sample data annotation method in the second aspect or any one of the possible implementations of the second aspect.
According to a ninth aspect, this application discloses a computer-readable storage medium, where the computer-readable storage medium stores computer program instructions. When the computer program instructions run on a computer, the computer performs the sample data annotation method in the third aspect or any one of the possible implementations of the third aspect.
To describe the technical solutions in embodiments of the present invention more clearly, the following briefly describes the accompanying drawings for describing embodiments. It is clear that the accompanying drawings in the following descriptions show merely some embodiments of the present invention.
To make a person skilled in the art understand the technical solutions in the present invention better, the following clearly describes the technical solutions in embodiments of the present invention with reference to the accompanying drawings in embodiments of the present invention. It is clear that the described embodiments are a part rather than all of embodiments of the present invention.
In embodiments of this application, the word “example”, “for example”, or the like is used to represent giving an example, an illustration, or a description. Any embodiment or design scheme described as an “example” or “for example” in embodiments of this application should not be explained as being more preferred or having more advantages than another embodiment or design scheme. Exactly, use of the word “example”, “for example”, or the like is intended to present a related concept in a specific manner. In the descriptions of embodiment of this application, unless otherwise stated, “a plurality of” means two or more than two. For example, a plurality of nodes refer to two or more nodes. “At least one” means any quantity, such as one, two, or more. “A and/or B” may be only A, only B, or include A and B. “At least one of A, B, and C” may be only A, only B, only C, or include A and B, include B and C, include A and C, or include A, B, and C. In this application, terms such as “first” and “second” are used only to distinguish between different objects, and are not used to indicate priorities or importance of the objects.
To improve annotation efficiency and quality of sample data, in an implementation, as shown in
The central node 110 or the edge node 120 may be deployed on at least one processor platform 200, as shown in
The sample data annotation system 100 may start one or more sample data annotation tasks, and each sample data annotation task may be started periodically (for example, a set time or a set period) or based on an external trigger condition (for example, original sample data reaches a threshold) or based on an instruction. The following describes, according to
Steps S311-S314 disclose a sample data annotation method applied to the edge node 310. In step S311, the edge node 310 obtains a key feature of sample data. Obtaining the key feature of the sample data may be that the edge node 310 extracts the key feature from the sample data after obtaining the sample data, or may receive a key feature of sample data sent by another device. Step 311 may be triggered when the edge node 310 receives the sample data, or may be periodically triggered. The key feature is a related attribute of the sample data, such as one or more of a total length of the sample data, a packet average interval of the sample data, an uplink-downlink direction, a collection starting time of the sample data, and/or an end period.
In step S312, the edge node 310 determines, based on the obtained key feature, whether the sample data is unknown sample data. In an implementation, the edge node 310 maps the key feature obtained in step S311 to an unknown sample data model of the edge node to determine whether the sample data is unknown sample data. In an implementation, the unknown sample data model is generated by the central node 320 based on a plurality of key features obtained from a known annotation result set. The known annotation result set includes a plurality of annotation results that are successfully annotated and obtained by the central node 320 before the unknown sample data model is generated. Each annotation result includes a sample identifier of the sample data, a sample feature of the sample data, and a label of the sample data. The label may be annotated by the central node 320, or may be annotated by the edge node 310 and obtained by performing consistency processing by the central node. In another implementation, when the sample data annotation system 100 performs annotation for the first time, the central node 320 generates a known annotation result set based on an empty annotation result set or an externally loaded annotation result set. In an implementation, the unknown sample data model includes a multidimensional coordinate space, and the multidimensional coordinate space is generated by the central node 320 based on the key feature of the known annotation result set. As shown in
In the sample data annotation system 100, a correlation between sample data may be represented by using a coordinate distance. When the correlation is lower than a preset threshold, the sample data is determined as unknown sample data. The edge node 310 may preset one or more thresholds. If a larger coordinate distance indicates a lower correlation, a specific implementation of step S312 may be that, for example, when the key feature of the sample data or the mapping feature of the key feature is obtained in step S310, and a coordinate distance in the multidimensional coordinate space is greater than or equal to a preset threshold, it may be determined that the sample data is unknown sample data. If a smaller coordinate distance indicates a lower correlation, a specific implementation of step S312 may be that, for example, when the key feature of the sample data or the mapping feature of the key feature is obtained in step S310, and a coordinate distance in the multidimensional coordinate space is less than or equal to a preset threshold, it may be determined that the sample data is unknown sample data. As shown in
In an implementation, the unknown sample data model further includes a neural network model. An inference (Inference) action may be performed on the key feature obtained in step S311 by using the neural network model. When the neural network model cannot obtain a correct inference result, it is considered that the sample data is unknown sample data.
When determining that the sample data is unknown sample data in step S312, in step S313, the edge node 310 annotates the sample data to obtain a first annotation result. In an implementation, a format of an annotation result in this application is shown in
Steps S321-S323 disclose a sample data annotation method applied to the central node 320. After receiving a first annotation result sent by each edge node (for example, 120A, 120B, 120C, and 120D in
In another implementation, the central node 320 may further perform batch processing on the received first annotation results, that is, divides the first annotation results into several batches. After steps S321-S323 are performed on first annotation results of the first batch, steps S321-S323 are performed again on first annotation results of the second batch, and so on, until all the received first annotation results are processed.
When step S321 indicates that the first annotation result is successfully annotated, step S322 is performed. In step S322, after performing consistency processing on the first annotation result indicating successful annotation, the central node 320 obtains a second annotation result. A format of the second annotation result is shown in
When step S321 indicates that the first annotation result fails to be annotated, step S323 is performed. In step S323, the central node 320 annotates the first annotation result indicating failed annotation to obtain a third annotation result, and a format of the third annotation result is shown in
Further, in an implementation, after obtaining the second and third annotation results, the central node 320 may generate a new unknown sample model or update an original unknown sample model based on the second and third annotation results. The newly generated or updated unknown sample model is sent to the edge node 310. After receiving the unknown sample model, the edge node 310 updates a locally existing unknown sample model.
Further, in an implementation, the edge node 310 or the central node 320 may perform encryption processing on data that is sent, and perform decryption processing on data that is received, where the encrypted/decrypted data includes the first annotation result or the unknown sample model.
The foregoing describes, from a system perspective, the sample data annotation method provided in the embodiment of this application. It may be understood that, to implement the foregoing functions, the edge node or the central node in the embodiment of this application includes a hardware structure and/or a software module for correspondingly implementing each function. A person skilled in the art should easily recognize that functions and steps of examples described in the embodiments disclosed in this application can be implemented in a form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or hardware driven by computer software depends on particular applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions, but such implementations shall not be considered to be beyond the scope of this application.
The following describes a structure of a node in this application from different perspectives. To implement the method shown in
An embodiment of this application also provides a computer-readable storage medium, configured to store program code for implementing the foregoing sample data annotation method, and instructions included in the program code are used to perform the method procedure in any one of the foregoing method embodiments. The foregoing storage medium may include any non-transitory machine-readable medium capable of storing program code, such as a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a random access memory (RAM), a solid-state drive (SSD), or a non-volatile memory.
It should be noted that embodiments provided in this application are merely examples. A person skilled in the art may clearly know that, for convenience and conciseness of description, in the foregoing embodiments, embodiments emphasize different aspects, and for a part not described in detail in one embodiment, refer to relevant descriptions of another embodiment. Features disclosed in embodiments, claims, and accompanying drawings of this application may exist independently or exist in a combination. In this embodiment of this application, features described in a hardware form may be performed by software, and vice versa. This is not limited herein.
The foregoing descriptions are merely specific implementations of the present invention, but are not intended to limit the protection scope of the present invention. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in the present invention shall fall within the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202010642572.5 | Jul 2020 | CN | national |
This application is a continuation of International Application No. PCT/CN2021/095786, filed on May 25, 2021, which claims priority to Chinese Patent Application No. 202010642572.5, filed on Jul. 6, 2020. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/095786 | May 2021 | US |
Child | 18150505 | US |