The present disclosure belongs to the field of electronic information, and are particularly related to a method and device for constructing a legal knowledge graph based on joint entity and relation extraction.
A knowledge graph, which is called knowledge domain visualization or knowledge domain choropleth map in the library and information industry, includes a series of different graphs showing the knowledge development process and structural relation. Knowledge resources and their carriers are described using visualization technology, and knowledge and their interrelations are mined, analyzed, constructed, drawn and displayed. A knowledge graph is a modern theory that combines the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with bibliometric citation analysis, co-occurrence analysis and other methods, and visually displays the core structure, development history, frontier fields and overall knowledge structure of a discipline by means of visualized graphs so as to realize multi-disciplinary integration, which can provide practical and valuable reference for discipline research.
Knowledge graphs are divided into general knowledge graphs and domain knowledge graphs. The two knowledge graphs are mainly different in coverage and usage. A general knowledge graph is oriented to the general field, which mainly contains a lot of common knowledge in the real world and covers a wide range. A domain knowledge graph, also known as industry knowledge graph or vertical knowledge graph, is oriented to a specific field, and it is an industry knowledge base composed of professional data in this field. Because it is based on industry data and has strict and rich data patterns, it has higher requirements for the depth and accuracy of knowledge in this field. A domain knowledge graph is a knowledge graph for a specific field, such as e-commerce, finance, medical care, etc. Comparatively speaking, the domain knowledge graph has more knowledge sources, faster scale expansion, a more complex knowledge structure, higher requirements for knowledge quality and more extensive application forms.
Nowadays, the construction and research of knowledge graphs oriented to the legal field are still scarce. With the rapid and comprehensive development of law, the demand for knowledge graphs in the legal field is gradually emerging. The construction of a domain knowledge graph needs a lot of information in this field. How to extract valuable information from massive unstructured or semi-structured information has attracted the attention of many scholars, and information extraction technology has emerged as the times require. The construction of a knowledge graph mainly uses the subtasks of entity extraction and relation extraction in information extraction. The entity extraction technology, also known as named entity recognition technology, refers to the identification of entities with specific meanings in extracted texts, mainly including names of people, places, institutions, proper nouns, and words such as time, quantity, currency, and proportional value. The main task of relation extraction is to extract the relations between entities in the text. Generally, the relation between entities is formally described as a triple relation <h, r, t>, where h and t represent a head entity and a tail entity, and r represents the relation between entities. For example, in the sentence “Farewell My Concubine is directed by Chen Kaige”, “Farewell My Concubine” and “Chen Kaige” are both entities, and the relation between the two entities is the “directed” relation, which can be expressed as “Farewell My Concubine, directed by Chen Kaige” by triples. The main purpose of information extraction is to extract data in the form of triples from a large number of unstructured or semi-structured texts. However, the traditional information extraction technology adopts a pipeline method. For unstructured texts, entity extraction is carried out first, and then relation extraction is carried out on the basis of the entity extraction result. This has a big drawback, that is, once the result of entity extraction is wrong, it will greatly affect the accuracy of relation extraction, which will lead to error propagation.
In view of the problem that the accuracy of knowledge graph construction in the legal field cannot be guaranteed in the prior art, the present disclosure provides a method and device for constructing a legal knowledge graph based on joint entity and relation extraction, so as to obtain a knowledge graph with higher accuracy.
In order to achieve the above purpose, the present disclosure has the following technical solution:
A method for constructing a legal knowledge graph based on joint entity and relation extraction, including the following steps:
Constructing a triple data set.
Splitting a sentence of a legal text into short sentences.
Completing default subjects in the short sentences.
Extracting triples from the short sentences and constructing the triple data set.
Designing a model architecture and training a model.
The model architecture comprises an encoding layer, a head entity extraction layer and a relation-tail entity extraction layer.
In an embodiment, the encoding layer uses a Bert pre-training model.
In the head entity extraction layer, two BiLSTMs are used as binary classifier, and codes of a text are used as an input of the classifiers; in output information, an entity start position corresponding to the first BiLSTM binary classifier is output as 1, with outputs of the other positions being all 0, and an entity end position corresponding to the second BiLSTM binary classifier is output as 1, with outputs of the other positions being all 0.
The relation-tail entity extraction layer combines coding information of head entities with the codes of the sentence as an input, and for each head entity, finds tail entities that may exist under each relation, and finally obtains a complete triple.
Determining the relation between the sentences of the text.
For a sentence of the legal text that has not been split into short sentences, determining the relation between the short sentences.
Triple combination and graph visualization.
Obtaining a combined triple corresponding to the legal text according to the triple extracted by the model and the relation between sentences of the text.
Visualizing the legal knowledge graph.
The present disclosure further provides a legal knowledge graph construction system based on joint entity and relation extraction, including:
A triple data set construction module configured to split a sentence of a legal text into short sentences, complete default subjects in the short sentences, and finally extract triples from the short sentences to constructing a triple data set.
A model building and training module configured to respectively construct an encoding layer, a head entity extraction layer and a relation-tail entity extraction layer in a model architecture and obtain a model capable of extracting triples by training.
An inter-sentence relation determining module configured to determine the relation between short sentences for a sentence of the legal text which has not be split into short sentences.
A knowledge graph visualization module configured to obtain a combined triple corresponding to the legal text according to the triples extracted by the model and the relation between sentences of the text, and realize visualization of the legal knowledge graph.
The present disclosure further provides a terminal device, comprising a memory, a processor and a computer program stored in the memory and running on the processor, wherein when the processor executes the computer program, the steps of the method for constructing a legal knowledge graph based on joint entity and relation extraction are implemented.
The present disclosure further provides a computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for constructing a legal knowledge graph based on joint entity and relation extraction.
Compared with the prior art, the present disclosure has the following beneficial effects:
The existing knowledge graph construction methods often adopt the idea of pipeline, which first extracts entities and then extracts relations with the results of entity extraction, which will lead to the accumulation of errors. The present disclosure utilizes a model based on joint entity and relation extraction to perform triple extraction for unstructured texts in the field of contract law, and finally constructs a knowledge graph in the field of law. The present disclosure can avoid error propagation caused by the pipeline method, and has high accuracy. The design of the model framework of the present disclosure adopts the Chinese Bert pre-training model as an encoder, which has a good effect on Chinese text encoding. As the entities in the field of law are in the form of phrases, the entity extraction part adopts two BiLSTM binary classifiers to identify the start position and end position of the entities, which can effectively extract the entities in the form of phrases from the text. According to the present disclosure, given a paragraph of a text, the head entity is extracted first, and then the tail entity corresponding to the entity relation is extracted by the extracted head entity. When extracting entity relations and tail entities, not only the encoding information of sentences is used, but also the encoding information of head entities is incorporated. A legal knowledge graph with high accuracy can be obtained by the method, and the constructed knowledge graph can realize the functions of question-and-answer reasoning and related recommendation in the field of contract law by combining with deep learning technology.
The present disclosure will be further explained in detail with reference to the drawings and embodiments below.
The present disclosure provides a method for constructing a legal knowledge graph based on joint entity and relation extraction. The embodiment is explained by taking the contract law as an example. The present disclosure can use a given contract law text to extract entities and relations at the same time, and finally obtains complete triple information. The extracted triples are connected end to end to form a knowledge graph of contract law. The completed knowledge graph can be combined with deep learning technology to realize the functions of question-and-answer reasoning and related recommendation in the field of contract law.
Entity extraction: for any complete contract law text statement, it can be decomposed into the form of (h, r, t), where h represents the head entity, r represents the entity relation, and t represents the tail entity. Entity extraction means extracting the head and tail entities in the text.
Relation extraction: the relation here refers to the relation between entities or the attributes of entities. This step usually involves extraction of the corresponding entity relation after the extraction of the head entity and the tail entity is completed.
Joint extraction: different from the previous entity extraction and relation extraction, which are carried out independently, the entities and relations extracted by joint extraction influence each other. Using joint extraction can reduce the error propagation problem caused by entity extraction.
Referring to
1. Constructing a Triple Data Set of Contract Law, Including:
1.1) Splitting a Complex Contract Law Text Statement into Simple Short Sentences.
According to the characteristics of contract law texts, most contract law texts are composed of two or more short sentences, and there is a certain logical relation between the short sentences, for example, in Article 9 of the Contract Law, “the parties shall have corresponding capacity for civil rights and civil capacity when concluding a contract”. In order to extract the triples from the text accurately, it is necessary to split the statement into several short sentences. Two short sentences can be obtained from this example: “the parties conclude a contract” and “shall have corresponding capacity for civil rights and capacity for civil conduct”.
1.2) Using the Technology of Zero Anaphora Resolution to Solve the Problem of Subject Loss Caused by Short Sentence Splitting.
Subjects may be missing in some short sentences in the split text short sentences obtained in step 1, which will affect the subsequent triple extraction, so it is necessary to complete the default subjects. In this method, the open source tool pyltp is combined with a dependency syntactic analysis method to perform syntactic analysis on the default part and complete the default subjects. The result of the examples in step 1, after subject completion, is the short sentences “the parties conclude a contract” and “(the parties) should have corresponding capacity for civil rights and capacity for civil conduct”.
1.3) Constructing a Triple Data Set of Contract Law.
For the short sentences after subject completion, the required triples can be extracted therefrom. In order to ensure the performance of the triple extraction model, it is necessary to manually label triple data to train the model. A complete sentence usually consists of three parts: the subject, the predicate and the object. Therefore, when marking data, the subject and the object of the sentence are taken as the head entity and the tail entity respectively, and the predicate is preliminarily marked as the relation. After labeling some triples, the relation set is determined according to the labeled relations, and the relations with the same or similar semantics are merged. For example, the relation “concept” and the relation “definition” are semantically similar, and the relation “definition” is unified into the relation “concept”.
Step 2: Designing the Model Architecture and Training the Model, Including:
2.1) Designing a Model Architecture.
After the labeling of triple training data, the following experimental model is designed. The design of this model mainly considers the following aspects. Firstly, the Bert pre-training model uses a bidirectional Transformer, and at the same time, it uses the Masked Language Model (MLM) to capture the word-level representation in the pre-training process, which makes the word vector change from only the previous information to information capable of learning the context, and uses the Next Sentence Prediction (NSP) to capture the sentence-level representation in the pre-training process. Therefore, using a Bert pre-training model in an encoding layer can better represent the deep meaning of sentences. Secondly, the entity in the text of contract law is different from the entity in the general field in that it contains not only word entities, but also phrase entities, so the traditional NER method cannot accurately extract phrase entities. Therefore, two BiLSTMs are considered as binary classifiers, and the text encoding is used as the input of the classifier; in the output information, the start position of the entity corresponding to the first BiLSTM classifier is 1, and the outputs of the other positions are all 0, while the end position of the entity corresponding to the second BiLSTM classifier is 1, and the outputs of the other positions are all 0; the position encoding of the start position and the end position of the entity is extracted respectively, so that phrase entities can be well extracted as needed. Thirdly, in order to make full use of the encoding information of the head entity in the extraction of the entity relation and the tail entity, the input of the extraction part of the entity relation and the tail entity of the model is not only the encoding information of the whole sentence, but the encoding information of the head entity is combined with the coding of the sentence, which has a good effect on the extraction of the entity relation and tail entity.
By the above analysis, an algorithm model of joint entity and relation extraction in the field of contract law is designed, and the model frame diagram is shown in the figure. The model is divided into three parts, namely, the encoding layer, the head entity extraction layer, and the relation-tail entity extraction layer. Refer to
(a) Encoding Layer
The encoding layer of the present disclosure adopts the Chinese pre-training model BERT-wwm-ext based on the whole word Mask on a larger scale corpus in the Xunfei Joint Laboratory of Harbin Institute of Technology, and the performance of the model is further improved on a plurality of benchmark tests. The model can be used to transform the input text into the form of feature vectors.
(b) Head Entity Extraction Layer
This layer is mainly composed of two BiLSTMs with the same structure. The feature vector xi output by a Bert encoding layer is used as an input, and the extracted start and end markers of the entity are output:
p
i
h_start=BiLSTM(Wsxi+bs)
p
i
h_end=BiLSTM(Wexi+be)
where xi is a feature vector of each word, Ws, We are weight matrices that can be trained by two binary classifiers, and bs, be are respective bias vectors; pistart is a marker of the start position of an entity, and when the value thereof approaches 1, it means that the position is the start position of the entity; piend is a marker of the end position of the entity, and when the value thereof approaches 1, it means that the position is the end position of the entity. For the example in
(c) Relation-Tail Entity Extraction Layer
Similar to the head entity extraction layer, this layer is also composed of two BiLSTMS with the same structure. The input of this layer model is not only the feature vector hs of the sentence, but also incorporates the head entity encoding hheadk extracted from the previous layer, where k represents the kth head entity. hs+hheadk is taken as the input vector of this layer. The specific formula is:
p
i
t_start=BiLSTM(Wsr(xi+hheadk)+bsr)
p
i
t_end=BiLSTM(Wer(xi+hheadk)+ber)
where vectors hs and hheadk have a relation of direct vector addition, and have the same dimension; for the kth head entity, an average value of word vectors from the start position to the end position is taken as a representation of the vector hheadk; Wsr and Wer are trainable parameter matrices indicating the start position and the end position; for each head entity, all the relations in the relation set are traversed, the above calculation formula is repeated to find tail entities that may exist under each relation, thereby finally obtaining a complete triple. For the example in
2.2) Using the Model
As shown in
For the extraction of triples of contract law texts not included in the test set, the trained model is used as the input of the model, and the output of the model is the triple corresponding to the text.
Step 3: Determining the Relation Between Text Sentences.
For the contract law text without short sentence splitting, the open source tool pyltp combined with a rule matching method is used to determine the relation between the short sentences. Sentence relations include four kinds of relations: conditional, turning, juxtaposition and causality, among which there are 85 causal relations, 194 conditional relations, 34 turning relations and 8 juxtaposition relations. For example, for the contract law text “the parties shall have corresponding capacity for civil rights and capacity for civil conduct when concluding a contract”, the sentence relation extracted from two short sentences is “conditional relation”.
Step 4, Triple Combination and Graph Visualization, Including:
4.1) Integrating Triples.
A combined triple corresponding to the contract law text can be obtained from the triples extracted from the model based on the inter-sentence relations obtained in process 3. For the example “the parties shall have corresponding capacity for civil rights and capacity for civil conduct when concluding a contract”, the final triple form can be ((parties, concluding, contract), conditions, (parties shall have corresponding capacity for civil rights and capacity for civil conduct)). By integrating and splicing all the extracted triples, a complete knowledge graph of contract law can be obtained.
4.2) Visualizing the Knowledge Graph of Contract Law.
See
A legal knowledge graph construction system based on joint entity and relation extraction includes:
A triple data set construction module configured to split a sentence of a legal text into short sentences, complete default subjects in the short sentences, and finally extract triples from the short sentences to constructing a triple data set.
A model building and training module configured to respectively construct an encoding layer, a head entity extraction layer and a relation-tail entity extraction layer in a model architecture and obtain a model capable of extracting triples by training.
An inter-sentence relation determining module configured to determine the relation between short sentences for a sentence of the legal text which has not be split into short sentences.
A knowledge graph visualization module configured to obtain a combined triple corresponding to the legal text according to the triples extracted by the model and the relation between sentences of the text, and realize visualization of the legal knowledge graph.
A terminal device includes a memory, a processor and a computer program stored in the memory and running on the processor, wherein when the processor executes the computer program, the steps of the method for constructing a legal knowledge graph based on joint entity and relation extraction are implemented.
A computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for constructing a legal knowledge graph based on joint entity and relation extraction.
The computer program can be divided into one or more modules/units, which are stored in the memory and executed by the processor to implement the method for constructing a knowledge graph according to the present disclosure.
The processor can be a Central Processing Unit (CPU), or can be other general-purpose processors, Digital Signal Processors (DSPs), application specific integrated circuits (ASICs), Field-Programmable Gate Array (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The memory can be used to store computer programs and/or modules, and the processor can realize various functions of the knowledge graph building system of the present disclosure by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory.
The steps of the method or algorithm described combined with the embodiments of the present disclosure may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions may consist of corresponding software modules, and the software modules can be stored in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), registers, hard disks, removable hard disks, CD-ROMs or any other forms of storage media well-known in the art. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. The storage medium can also be an integral part of the processor. The processor and storage medium may reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the ASIC may be located in a node device, such as the processing node described above. In addition, the processor and storage medium may also exist in the node device as discrete components.
It should be noted that when the data compression apparatus provided in the foregoing embodiment performs data compression, division into the foregoing functional modules is used only as an example for description. In an actual application, the foregoing functions can be allocated to and implemented by different functional modules based on a requirement, that is, an inner structure of the apparatus is divided into different functional modules, to implement all or some of the functions described above. For details about a specific implementation process, refer to the method embodiment. Details are not described herein again.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When the software is used for implementation, all or some of the embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or a terminal, all or some of the procedures or functions according to the embodiments of this application are generated. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center to another web site, computer, server, or data center in a wired (for example, a coaxial optical cable, an optical fiber, or a digital subscriber line) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a server or a terminal, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disk (DVD)), or a semiconductor medium (for example, a solid-state drive). The above description is only a preferred embodiment of the present disclosure, and it is not intended to limit the technical solution of the present disclosure. It should be understood by those skilled in the art that the technical solution can be modified and replaced in a number of simple ways without departing from the spirit and principle of the present disclosure, and all these modifications and substitutions shall fall within the scope of protection covered by the claims.
Number | Date | Country | Kind |
---|---|---|---|
202110513432.2 | May 2021 | CN | national |
The present application is a continuation of International Application No. PCT/CN2021/116053, filed on Sep. 1, 2021, which claims priority to Chinese Application No. 202110513432.2, filed on May 11, 2021, the contents of both of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/116053 | Sep 2021 | US |
Child | 17956864 | US |