Since texts such as corpus data can often adopt the contents of historical texts, it is inefficient to rewrite and organize the corpus every time a new text is produced. In addition, the ready-made corpus data are generally tested for a long time, its stability and accuracy are high, if the text is rewritten, it is difficult to avoid semantic omissions.
In general, corpus data in historical texts are arranged or organized according to rules, and there are semantic relationships among them, taking these data as materials and producing new texts according to the requirements of the new texts are ways to be considered.
Embodiments of the present application provide a data clustering method and system, wherein the data storage method and system are used for decomposing historical clustering data into clustering atoms and storing the clustering atoms, furthermore, the data clustering method and system can produce new clustering data according to the clustering atom, so as to improve the efficiency of clustering data and reduce the error probability of clustering data.
According to one aspect of this application, a data clustering method is provided, comprises: the historical clustering data is analyzed, and which is decomposed into clustering atoms based on the properties of each part of the historical clustering data, and associating the clustering atoms with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong; and a clustering atomic pool is formed according to the properties of the clustering atoms, and the clustering atomic pool includes an unstructured relationship of the clustering atoms; searching the clustering atoms from the clustering atom pool to form alternative clustering atoms, wherein the search is based on target clustering attribute of the target clustering data, the clustering attribute associated with the clustering atom and the properties of the clustering atom; and the target clustering data is formed by referencing the alternative clustering atoms.
In some embodiments of this application, optionally, the historical clustering data is the historical corpus clustering data, and the clustering atom is the corpus clustering atom.
In some embodiments of this application, optionally, the search is also based on corpus matching.
In some embodiments of this application, optionally, the clustering atoms are organized in the form of a graph database and stored in a clustering atom pool.
In some embodiments of this application, optionally, the search is based on a method of searching graph.
In some embodiments of this application, optionally, the clustering atoms have hierarchies, wherein: a superior clustering atom is taken as the alternative clustering atom while its inferior clustering atom is also taken as the alternative clustering atom; and a superior clustering atom can be traced up by an inferior clustering atom which is an alternative clustering atom, and the superior clustering atom is set as the alternative clustering atom.
In some embodiments of this application, optionally, the clustering attribute comprises object, kind, region, sex, age and period.
In some embodiments of this application, optionally, if the referenced alternative clustering atoms are not compatible with each other, a hint message is generated.
According to one aspect of this application, a data storage method is provided, comprises: the historical clustering data is analyzed, and which is decomposed into clustering atoms based on the properties of each part of the historical clustering data, and associating the clustering atoms with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong; and a clustering atomic pool is formed according to the properties of the clustering atoms, and the clustering atomic pool includes an unstructured relationship of the clustering atoms.
In some embodiments of this application, optionally, the historical clustering data is the historical corpus clustering data, and the clustering atom is the corpus clustering atom.
In some embodiments of this application, optionally, the clustering atoms are organized in the form of a graph database and stored in a clustering atom pool.
In some embodiments of this application, optionally, the clustering attribute comprises object, kind, region, sex, age and period.
According to another aspect of the application, a computer-readable storage medium is provided in which an instruction is stored, characterized in that when the instruction is executed by a processor, causing the processor to perform any of the methods described above.
According to another aspect of this application, a data clustering system is provided, comprises: an analyzing unit, which is configured to analyze historical clustering data, and which is decomposed into clustering atoms based on the properties of each part of the historical clustering data, and associate the clustering atoms with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong; a pooling unit, which is configured to form a clustering atomic pool according to the properties of the clustering atoms, comprising an unstructured relationship of the clustering atoms; a search unit, which is configured to search the clustering atoms from the pooling unit to form an alternative clustering atom, wherein the search is based on target clustering attribute of the target clustering data, the clustering attribute associated with the clustering atom and the properties of the clustering atom; and an assembly unit, which is configured to form the target clustering data by referencing the alternative clustering atoms.
According to another aspect of this application, a data storage system is provided, comprises: an analyzing unit, which is configured to analyze historical clustering data, and which is decomposed into clustering atoms based on the properties of each part of the historical clustering data, and associate the clustering atoms with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong; a storage unit, which is configured to form a clustering atomic pool according to the properties of the clustering atoms, comprising an unstructured relationship of the clustering atoms.
The above and other purposes and advantages of this application are further clarified in the following details in conjunction with the attached drawings, in which the same or similar elements are indicated by the same label.
For the purposes of brevity and illustration, this article describes the principles of this application mainly by reference to its demonstration embodiments. However, it will be easy for person skilled in the art to realize that the same principles can be applied equally to all types of data clustering methods and systems, data storage methods and systems, and storage media, these same or similar principles may be applied therein, and any such changes shall not be contrary to the true spirit and scope of this application.
According to one aspect of the application, a data clustering method is provided. As shown in
The historical clustering data and the target clustering data in this application are the same data in the way of using. For example, both of them are advertising text, legal text, agreement text, and other application data with reorganized clustering atoms, they can also be program code and other application data with reorganized clustering atoms, or they can be the original product used to construct contracts such as insurance financing contracts (a final contract may be formed according to the products).
Both of the historical clustering data and the target clustering data in this application include clustering atoms. In the context, the clustering atom can be the smallest unit that cannot be subdivided in the historical clustering data and the target clustering data, and once more subdivision will have no clustering significance; the clustering atom can also be a set of several smallest constituent units. Each clustering atom has its own properties, and these clustering atoms constitute the historical clustering data. For example, the text of an agreement can include terms, subject matter, liability, and so on, where the “TERMS” section, “SUBJECT MATTER” section and “LIABILITY” section could serve as clustering atoms, and the properties of these clustering atoms can be terms, subject matter, and liability. For example, regarding to program code, clustering atoms can be a function which implements particular functions, and the particular functions constitute the properties of the function.
In step S201, the data clustering method 20 of the present application analyses the historical clustering data, and which is decomposed into clustering atoms based on the properties of each part of the historical clustering data. As shown in
As shown in
Associate the clustering atoms decomposed from the historical clustering data with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong. The clustering atoms are decomposed from historical clustering data to which the clustering atoms belong, so it inherits or relates at least part of the attributes of the historical clustering data. It is convenient to associate and reorganize clustering atoms by assigning attributes to them.
As shown in
In some embodiments of this application, regarding to general semantic text, the clustering attribute may include language type, literary style and so on. Regarding to general contracts, clustering attributes can include: object (subject matter), category, region, sex, age, (effective) period, and so on. Regarding to original products, which are used to construct contracts such as insurance financing contracts, clustering attributes can also include types of insurance, time of sale, and so on. Regarding to program code, the clustering attribute can be the problem solved by the program code or the function implemented by the program code, for example, a crawler function, an API called by mailbox, and so on. These clustering attributes reflect the role of historical clustering data in solving historical technical problems, and the decomposed clustering atoms can inherit or associate these clustering attributes and be further used to solve subsequent technical problems. The clustering attributes inherited or associated by clustering atoms can be used as a basis for selecting clustering atoms, thus avoiding the low efficiency of blind selection.
In step S202, according to the properties of the clustering atoms, the data clustering method 20 of the present application forms a clustering atom pool which includes the unstructured relation of clustering atoms. In the embodiment of this application, clustering atoms are pooled to form an efficient organization. Further, it is convenient to call the clustering atoms among the associated clustering atoms. As shown in
Refer to
In step S203, the data clustering method 20 of this application searches clustering atoms from the clustering atom pool to form alternative clustering atoms. Search the clustering attributes of the target clustering data, the clustering attributes of the clustering atom association and the properties of the clustering atom. In some embodiments of this application, the search is based on a method of searching graph. For example, to construct target clustering data 105 as shown in
In step S204, the data clustering method 20 of this application forms the target clustering data by referencing the alternative clustering atoms. Many kinds of alternative choices may be gained by the search in the step S203. At this time, the target clustering data 105 can be further constructed by selecting appropriate options from these alternative clustering atoms based on the requirement. As shown in
In some embodiments of the present application, the historical clustering data is the historical corpus clustering data, and the clustering atoms are the corpus clustering atoms. For example, historical clustering data can be application data with reorganized clustering atoms, such as agreement text, the clustering atom is each chapter of the agreement text (also known as “Paragraphs”), and these chapters can be used to be assembled into other agreement text. The chapter has the same “Properties” in the agreement text as it does in the assembled agreement text (such as “TERMS” section, “SUBJECT MATTER” section and “LIABILITY” section and etc.).
In some embodiments of this application, the search is also based on corpus matching. It is described above that the search is based on target clustering attribute of the target clustering data, the clustering attribute associated with the clustering atom and the properties of the clustering atom. And in other embodiments, it can also further restrict the search results according to the corpus matching, and make the alternative clustering atoms more semantically meet the search requirements. Corpus matching can include keyword matching, synonym matching and so on.
In some embodiments of this application, there are hierarchical relationships among clustering atoms, wherein: a superior clustering atom is taken as the alternative clustering atom while its inferior clustering atom is also taken as the alternative clustering atom; and a superior clustering atom can be traced up by an inferior clustering atom which is an alternative clustering atom, and the superior clustering atom is set as the alternative clustering atom. Further refer to
In some embodiments of this application, if the referenced alternative clustering atoms are not compatible with each other, a hint message is generated. In some embodiments, two or more alternative clustering atoms should not be referenced at the same time, and a hint message can be generated if there is a reference conflict. For example, if both of the clustering atom 1012 and the clustering atom 1022 have the same properties and meet the search conditions, both of the clustering atom 1012 and the clustering atom 1022 will be selected as alternative clustering atoms at the same time. Because the target clustering data 105 only needs one paragraph that meets specific properties, clustering atom 1012 and clustering atom 1022 cannot be referenced at the same time. In some embodiments, if the user initiates a reference to the clustering atom 1012 and the clustering atom 1022 at the same time, the system can alert the user the conflicts in the reference by returning a hint message. The above only shows a specific situation of “Incompatibility”, which would not limit the scope of the protection of the invention.
According to one aspect of this application, a data storage method is provided. As shown in
In step S301, the historical clustering data is analyzed, and which is decomposed into clustering atoms based on the properties of each part of the historical clustering data. As shown in
Associate the clustering atoms decomposed from the historical clustering data with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong. The clustering atoms are decomposed from historical clustering data to which the clustering atoms belong, so it inherits or relates at least part of the attributes of the historical clustering data. It is convenient to associate and reorganize clustering atoms by assigning attributes to them.
As shown in
In some embodiments of this application, regarding to general semantic text, the clustering attribute may include language type, literary style and so on. Regarding to general contracts, clustering attributes can include: object (subject matter), category, region, sex, age, (effective) period, and so on. Regarding to original products, which are used to construct contracts such as insurance financing contracts, clustering attributes can also include types of insurance, time of sale, and so on. Regarding to program code, the clustering attribute can be the problem solved by the program code or the function implemented by the program code, for example, a crawler function, an API called by mailbox, and so on. These clustering attributes reflect the role of historical clustering data in solving historical technical problems, and the decomposed clustering atoms can inherit or associate these clustering attributes and be further used to solve subsequent technical problems. The clustering attributes inherited or associated by clustering atoms can be used as a basis for selecting clustering atoms, thus avoiding the low efficiency of blind selection.
In step S302, according to the properties of the clustering atoms, a clustering atom pool is formed, which includes the unstructured relation of clustering atoms. In the embodiment of this application, clustering atoms are pooled to form an efficient organization. Further, it is convenient to call the clustering atoms among the associated clustering atoms. As shown in
Refer to
In some embodiments of the present application, the historical clustering data is the historical corpus clustering data, and the clustering atoms are the corpus clustering atoms. For example, historical clustering data can be application data with reorganized clustering atoms, such as agreement text, the clustering atom is each chapter of the agreement text (also known as “Paragraphs”), and these chapters can be used to be assembled into other agreement text. The chapter has the same “Properties” in the agreement text as it does in the assembled agreement text (such as “TERMS” section, “SUBJECT MATTER” section and “LIABILITY” section and etc.).
According to another aspect of this application, a data clustering system is provided. As shown in
The analyzing unit 401 can associate the clustering atoms with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong. The clustering atoms are decomposed from historical clustering data to which the clustering atoms belong, so it inherits or relates at least part of the attributes of the historical clustering data. It is convenient to associate and reorganize clustering atoms by assigning attributes to them.
As shown in
In some embodiments of this application, regarding to general semantic text, the clustering attribute may include language type, literary style and so on. Regarding to general contracts, clustering attributes can include: object (subject matter), category, region, sex, age, (effective) period, and so on. Regarding to original products, which are used to construct contracts such as insurance financing contracts, clustering attributes can also include types of insurance, time of sale, and so on. Regarding to program code, the clustering attribute can be the problem solved by the program code or the function implemented by the program code, for example, a crawler function, an API called by mailbox, and so on. These clustering attributes reflect the role of historical clustering data in solving historical technical problems, and the decomposed clustering atoms can inherit or associate these clustering attributes and be further used to solve subsequent technical problems. The clustering attributes inherited or associated by clustering atoms can be used as a basis for selecting clustering atoms, thus avoiding the low efficiency of blind selection.
A pooling unit 402 is configured to form a clustering atomic pool according to the properties of the clustering atoms, comprising an unstructured relationship of the clustering atoms. In the embodiment of this application, clustering atoms are pooled to form an efficient organization. Further, it is convenient to call the clustering atoms among the associated clustering atoms. As shown in
Refer to
A search unit 403 is configured to search the clustering atoms from the pooling unit to form an alternative clustering atom, wherein the search is based on target clustering attribute of the target clustering data, the clustering attribute associated with the clustering atom and the properties of the clustering atom. For example, to construct target clustering data 105 as shown in
An assembly unit 404 is configured to form the target clustering data by referencing the alternative clustering atoms. Many kinds of alternative choices may be gained from the search by the search unit 403. At this time, the target clustering data 105 can be further constructed by selecting appropriate options from these alternative clustering atoms based on the requirement. As shown in
In some embodiments of the present application, the historical clustering data is the historical corpus clustering data, and the clustering atoms are the corpus clustering atoms. For example, historical clustering data can be application data with reorganized clustering atoms, such as agreement text, the clustering atom is each chapter of the agreement text (also known as “Paragraphs”), and these chapters can be used to be assembled into other agreement text. The chapter has the same “Properties” in the agreement text as it does in the assembled agreement text (such as “TERMS” section, “SUBJECT MATTER” section and “LIABILITY” section and etc.).
In some embodiments of this application, the search is also based on corpus matching. It is described above that the search is based on target clustering attribute of the target clustering data, the clustering attribute associated with the clustering atom and the properties of the clustering atom. And in other embodiments, it can also further restrict the search results according to the corpus matching, and make the alternative clustering atoms more semantically meet the search requirements. Corpus matching can include keyword matching, synonym matching and so on.
In some embodiments of this application, there are hierarchical relationships among clustering atoms, wherein: a superior clustering atom is taken as the alternative clustering atom while its inferior clustering atom is also taken as the alternative clustering atom; and a superior clustering atom can be traced up by an inferior clustering atom which is an alternative clustering atom, and the superior clustering atom is set as the alternative clustering atom. Further refer to
In some embodiments of this application, if the referenced alternative clustering atoms are not compatible with each other, a hint message is generated. In some embodiments, two or more alternative clustering atoms should not be referenced at the same time, and a hint message can be generated if there is a reference conflict. For example, if both of the clustering atom 1012 and the clustering atom 1022 have the same properties and meet the search conditions, both of the clustering atom 1012 and the clustering atom 1022 will be selected as alternative clustering atoms at the same time. Because the target clustering data 105 only needs one paragraph that meets specific properties, clustering atom 1012 and clustering atom 1022 cannot be referenced at the same time. In some embodiments, if the user initiates a reference to the clustering atom 1012 and the clustering atom 1022 at the same time, the system can alert the user the conflicts in the reference by returning a hint message. The above only shows a specific situation of “Incompatibility”, which would not limit the scope of the protection of the invention.
According to another aspect of this application, a data storage system is provided. As shown in
Associate the clustering atoms decomposed from the historical clustering data with at least one of the clustering attributes of the historical clustering data to which the clustering atoms belong. The clustering atoms are decomposed from historical clustering data to which the clustering atoms belong, so it inherits or relates at least part of the attributes of the historical clustering data. It is convenient to associate and reorganize clustering atoms by assigning attributes to them.
As shown in
In some embodiments of this application, regarding to general semantic text, the clustering attribute may include language type, literary style and so on. Regarding to general contracts, clustering attributes can include: object (subject matter), category, region, sex, age, (effective) period, and so on. Regarding to original products, which are used to construct contracts such as insurance financing contracts, clustering attributes can also include types of insurance, time of sale, and so on. Regarding to program code, the clustering attribute can be the problem solved by the program code or the function implemented by the program code, for example, a crawler function, an API called by mailbox, and so on. These clustering attributes reflect the role of historical clustering data in solving historical technical problems, and the decomposed clustering atoms can inherit or associate these clustering attributes and be further used to solve subsequent technical problems. The clustering attributes inherited or associated by clustering atoms can be used as a basis for selecting clustering atoms, thus avoiding the low efficiency of blind selection.
The storage unit 502 is configured to form a clustering atomic pool according to the properties of the clustering atoms, comprising an unstructured relationship of the clustering atoms. In the embodiment of this application, clustering atoms are pooled to form an efficient organization. Further, it is convenient to call the clustering atoms among the associated clustering atoms. As shown in
Refer to
In some embodiments of the present application, the historical clustering data is the historical corpus clustering data, and the clustering atoms are the corpus clustering atoms. For example, historical clustering data can be application data with reorganized clustering atoms, such as agreement text, the clustering atom is each chapter of the agreement text (also known as “Paragraphs”), and these chapters can be used to be assembled into other agreement text. The chapter has the same “Properties” in the agreement text as it does in the assembled agreement text (such as “TERMS” section, “SUBJECT MATTER” section and “LIABILITY” section and etc.).
According to another aspect of the application, a computer readable storage medium is provided, in which instructions are stored so that the processor performs any of the methods described above when the instructions are executed by the processor. The computer readable media referred to in this application include various types of computer storage media, which may be any available media accessible by a general-purpose or special-purpose computer. For example, a computer-readable medium may include RAM, ROM, EPROM, E2PROM, registers, hard disks, removable disks, CD-ROM or other CD storage devices, disk storage devices or other magnetic storage devices, or any other temporary or non-temporary medium capable of carrying or storing desired program code units in the form of instructions or data structures and capable of being accessed by a general-purpose or specific-purpose computer or a general-purpose or specific-purpose processor. For example, the disk used in this paper usually copies data magnetically, while the disk copies data optically with a laser. The above combination shall also be included in the scope of protection of the computer readable medium. The exemplary storage medium is coupled to the processor so that the processor can read and write information from/to the storage medium. In the alternative, the storage media can be integrated into the processor. Processors and storage media can reside in an ASIC. An ASIC can reside in a user terminal. In the replacement scenario, the processor and storage media can reside in the user terminal as discrete components.
The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this. Person skilled in the art may think of other feasible changes or substitutions in the light of the scope of technology disclosed in this application, which are all covered by this application. In the absence of conflict, the means of implementation of this application and the features of the means of implementation may also be combined with each other. The scope of protection of this application shall be governed by the record of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202011292917.5 | Nov 2020 | CN | national |
This application is a National Stage of International Application No. PCT/CN2021/128330, filed Nov. 3, 2021, which in turn claims the benefit of Chinese Patent Application 202011292917.5, filed Nov. 18, 2020. The entire disclosures of the above applications are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/128330 | 11/3/2021 | WO |