The following invention aims to introduce a detection method for factual events in digital newspapers. In detail, the method has practical application in many automatic systems that include collecting, extracting, and analyzing text data to obtain events and track their related targets, thus providing early warnings about targets' activities to assist humans in making proper decisions and handling the incoming incidents in a timely manner.
In the age of information technology, texts from online news platforms are an abundant source of data that needs to be explored. This calls for a system that automatically collects, extracts, and analyzes such information. The obtained data from this system will be used for subsequent tasks such as tracking history or early warning of targets' actions to assist organizations in taking appropriate and timely decisions.
Nowadays, there are plenty of information extraction systems that try to obtain events from text, but they do not place an emphasis on how to extract the events that actually occur. Specifically, such systems provide structured data about triggers (words that represent events) and arguments (text of participants, location, and time of events). However, these events may be merely hypothetical facts or non-specific mentions, or they are negated as not happening. To overcome the problem, this invention proposes a solution to detect events mentioned as occurring in fact. As a result, the method helps in filtering out unwanted events, lowering error propagation to downstream tasks, and reducing overall system inference time.
The purpose of the proposed invention is to detect events that actually occur by employing some deep learning approaches. The detection method is performed through the following steps:
The proposed solution focuses on extracting triggers and classifying the realis status of each trigger word into three categories: “ACTUAL”, “GENERIC” and “OTHER”, where “ACTUAL” refers to specific events that are explicitly indicated to have occurred in the past, present, or future; “GENERIC” refers to events that are mentioned without any participants, places, or times; “OTHER” describes the remaining events such as those stated as not happening, believed events, and hypothetical events. The method's output includes the trigger words, their sentence spans, event types, and realis statuses. Since the target is to detect actual events, only triggers labeled “ACTUAL” are processed in the next steps, such as event argument extraction.
Refer to
Current deep learning algorithms compute based on numbers, but raw text only contains words. Therefore, converting this information into numerical forms is required to input it into computational models. First, the text must be split into sentences because current encoding models mostly deal with the sentence level. Each sentence is then split into words to serve the POS tagging and digitizing task. The reason why POS tags are employed is that triggers are predefined as nouns, verbs, or adjectives, so these tags could provide some useful features to the detection model. In short, this step processes a raw text to obtain two numerical lists, one of which contains word identities and the other consists of POS tag codes.
Regarding sentence splitting, an unsupervised learning method is employed to build a model for the identification of acronyms, common phrases, and words that start a sentence. This model is trained on a large corpus of text in a specific language before being used. For instance, the document “Mr. Biden will join 30 NATO leaders at the special, hastily organized meeting, then will go to a previously scheduled European Council summit. twenty-one of the European Union's 27 members belong to NATO, and it is possible that close NATO allies like Sweden and Finland will also attend the meeting.” will be separated into two sentences: “Mr. Biden will join 30 NATO leaders at the special, hastily organized meeting, then will go to a previously scheduled European Council summit.” and “twenty-one of the European Union's 27 members belong to NATO, and it is possible that close NATO allies like Sweden and Finland will also attend the meeting.” This model knows the dot in “Mr. Biden” is not a sign of a new sentence and the beginning of the sentence is not always capitalized as “twenty-one.”
Each split sentence is further divided into words. The splitting method relies on regular expressions and the use of whitespace. For example, the sentence “Mr. Biden will attend NATO's summit.” will be split into a list of words [“Mr.”, “Biden”, “will”, “attend”, “NATO”, “'s”, “summit”, “.”].
Once the word lists for each sentence have been collected, they are converted into numerical data. This stage requires a predefined vocabulary with frequent sub-words found in the corpus. Starting with the current word, this method searches the vocabulary for the longest sub-word and then splits it into two units that contain this sub-word. The other unit is handled in the same manner as described above until all units have been digitized. For example, the word “hugs” has the longest sub-word in the vocabulary “hug”, so it is divided into “hug” and “##s”. Turning to the sub-word “##s”, because it is already in the dictionary, the final result is [8363; 1116] for the two sub-word codes of [“hug”, “##s”]. If the sub-words are not in the vocabulary, the word's digitization yields the special character “[UNK]” with code 100.
In addition to sentence separation, word separation, and word digitization, this stage also performs POS tagging of each word and converts it into a numerical form. POS tags are automatically recognized using a pre-trained language model. Then, a lookup table is used to get the numeric value corresponding to each tag. Considering a split sentence: [“Mr.”, “Biden”, “will”, “attend”, “NATO”, “'s”, “summit”, “.”], the identification results are [NNP, NNP, MD, VB, NNP, NNP, NNP, PUNCT] and their referring codes are [1,1,2,3,1,1,1,1,10]. Where: NNP is a singular proper noun, MD is a modal verb, VB is a verb in base form, and PUNCT is punctuation.
The meaning of the word in the context of the sentence is an important feature for conducting event classification. Furthermore, the POS tag is also a sign of the trigger word. To utilize these kinds of information, features must be extracted as vectors with a predefined number of dimensions. This process takes the output from the previous step as its input, and the outcome is a list of encoded vectors corresponding to each word in the sentence.
To begin, numbers representing the word in a sentence are fed into the encoder to find the semantic vector of each word. The symbol S={w1, w2, w3, . . . , wn} represents the sentence S containing n digitized words from w1 to wn, after passing it to the encoder En we obtain a set of contextual vectors V=En(S)={v1, v2, v3, . . . , vn}. Each vector vi contains information about the meaning of the wi word as well as the entire sentence's context. For instance, the two sentences “The sink is in the kitchen.” and “The plane sinks into the river.”, both contain the sub-word “sink” but their meanings are different, so the two semantic vectors of the two words have a large distance. The En encoder is principally a pre-trained deep learning language model, but in this invention, its parameters are updated during training to ensure the best quality for the process.
The POS tag's code ti is then fed into an embedding layer Em one by one to extract the relevant features. After this step, an embedded vector pi=Em(ti) is created. This embedding layer is also a deep learning model, acting as a mapping table from the single-dimensional space R1 to the m-dimensional space Rm. The m-value is a training hyperparameter that can be improved based on the evaluation of the cross-validation dataset.
Finally, each vi and pi vector are combined to form a single encoded vector Vi:
Suppose vi is a k-dimensional vector and pi is an m-dimensional vector, the dimension of the obtained vector Vi is (k+m)
This step is to locate the triggers and classify their event types in a sentence by a neural network, which has the architecture shown in
The loss function used during training is a cross-entropy function in the form of the formula (3), where yi and ŷl are the annotated and predicted event types of the word wi, respectively.
The aggregate feature vectors are prepared for the realis classification in this step. Using a list of encoded vectors from step 2 and trigger word positions from step 3, this process creates a vector for each trigger by joining vectors around it. Specifically, the number of neighboring vectors on the left is chosen as l and on the right as r. l is usually equal to r, and these values are adjustable during training. For the trigger with index j in the sentence, the result Vout is performed as (4):
V
out
=[V
j−l
;V
j−l−1
; . . . ;V
j
; . . . ;V
j+r−1
;V
j+r]T (4)
where, Vj is the j-th word's encoded vector that is obtained from step 2, T is the matrix transposition symbol.
In some special cases, the trigger word may be at the beginning or end of a sentence. As a result, the index of neighboring words may be outside the sentence's index limit. Meanwhile, the size of Vout must be a fixed number and equal to (r+l+1)×(k+m), so the zero vector Z=[0; 0; . . . ; 0]T is used to compensate for the missing vector positions.
Considering the sentence [“Mr.”, “Biden”, “will”, “attend”, “NATO”, “'s”, “summit”, “.”] whose output after the coding step is [V0, V1, V2, V3, V4, V5, V6, V7]. Suppose l=r=4 and the trigger index j=3 corresponds to the word “attend” which has the event type “Meet”. The result of this step is the vector [Z; V0; V1; V2; V3; V4; V5; V6; V7]T.
In this step, the model ingests the vector Vout to generate the classification label of three realis statuses: “ACTUAL”, “GENERIC” and “OTHER”. In particular, the term “ACTUAL” refers to events that have occurred, are currently happening, or are about to take place, and they must be specifically confirmed in the document; the term “GENERIC” refers to events without a specific participant, location, or time; the remaining events, such as believed events, hypothetical events, and negated events, are annotated as “OTHER”.
In general, the proposed method uses four deep learning models, including a word encoder and an embedding layer in step 2, as well as two neural networks in steps 3 and 5. The encoder's weights have already been trained, so they only need to be initialized and updated from a checkpoint. However, the embedding layer and neural networks must be trained from scratch. All of the above models are trained together with the aggregate loss function L calculated as the formula (5), where α is the coefficient controlling the influence of each component loss function (0<α<1).
L=αL
1+(1−α)L2 (5)
The training approach is based on a backpropagation algorithm combined with an optimization method. Backpropagation can calculate the gradient of the loss function for each of the network's weights. The gradient value is then fed into the optimizer, which uses it to update the weights to minimize the loss function. The chosen optimizer is ADAM, a stochastic gradient descent method based on adaptive estimation of first- and second-order moments. Equation (6) explains how this optimizer updates the model weights:
The training process also includes tuning hyperparameters to achieve the best results. These parameters in the invention consist of: the embedding vector dimension m; the number of neurons in the hidden layer h1, h2 in step 3 and 5, respectively; the number of neighboring vectors l, r; and the coefficient α. First, a list of the values of each hyperparameter is created. Due to the large search space, the random search method is preferred over the grid search approach during the tuning process. Accordingly, instead of executing all points in the grid, this method simply resorted to a randomly selected subset. At each point, we perform training and evaluation based on the cross-validation dataset. Finally, we select the best set of hyperparameters to ensure the system's quality.
The output of the proposed solution is described in the following three texts and their corresponding results:
The trigger, its event type, and realis in this text are, respectively:
In this text, the trigger, its event type, and realis are as follows:
The following are the trigger, event type, and realis in this text, respectively:
Number | Date | Country | Kind |
---|---|---|---|
1-2022-05577 | Aug 2022 | VN | national |