This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2020-085172, filed on May 14, 2020, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a change detecting technique.
In recent years, for example, in order to identify a topic included in document data such as minutes of a meeting or the like (hereinafter, also referred to as target document data), an information processing system for detecting sentences related to the same topic has been constructed.
Specifically, the information processing system calculates a similarity of contents of each sentence included in the target document data by using, for example, statistical information about the appearance frequency of each word in other document data (hereinafter, also referred to as training document data). Then, by using the calculated similarity, the information processing system distributes the sentences included in the target document data to multiple clusters, such that multiple sentences which may be determined to have similar contents are distributed to the same cluster. Further, the information processing system outputs, for example, a determination result that one or more sentences distributed to the same cluster are related to the same topic.
Related technologies are disclosed in, for example, Japanese Laid-Open Patent Publication No. 2015-225134, Japanese Laid-Open Patent Publication No. 2007-241902, and Japanese Laid-Open Patent Publication No. 2004-185135.
According to an aspect of the embodiment, a non-transitory computer-readable recording medium has stored therein a program that causes a computer to execute a process, the process including: calculating, based on words included in each of a plurality of sentences included in a target document, a plurality of vectors that respectively correspond to the plurality of sentences; executing a frequency analysis based on the plurality of vectors and a time axis associated with the plurality of vectors according to a writing order of the plurality of sentences in the target document; and outputting information that indicates a position that corresponds to a change point identified based on a result of the frequency analysis, in the target document.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
When distributing the sentences included in the target document data to multiple clusters, the information processing system makes a determination in consideration of a relationship of each sentence with its previous and subsequent sentences (hereinafter, also simply referred to as the context).
However, the range of sentences of which context needs to be considered varies according to the presence/absence of noise included in the target document data (sentences unrelated to a topic that corresponds to each cluster). Thus, when distributing the sentences included in the target document data to multiple clusters, the information processing system needs to make a determination while changing the range of sentences of which context needs to be considered. As a result, the information processing system may require a relatively long time for detecting sentences related to the same topic in the target document data.
<Configuration of Information Processing System>
First, a configuration of an information processing system 10 will be described.
The operation terminal 3 is a terminal with which, for example, an operation inputs necessary information or the like, and may be a PC (personal computer). Further, the operation terminal 3 is a terminal capable of communicating with the change detecting device 1 via a network NW.
The change detecting device 1 is configured by, for example, one or more physical or virtual machines, and performs a process of detecting a change point of a topic in target document data (hereinafter, also referred to as a change detecting process).
Specifically, for example, by using statistical information about the appearance frequency of each word in training document data, the change detecting device 1 calculates vector values that correspond to the sentences included in the target document data, respectively. Then, by using a similarity of the calculated vector values, the change detecting device 1 distributes the sentences included in the target document data to multiple clusters, such that multiple sentences which may be determined to have similar contents are distributed to the same cluster. Then, the change detecting device 1 outputs, for example, a determination result that one or more sentences distributed to the same cluster are related to the same topic, to the operation terminal 3. Hereinafter, a specific example of the process performed in the change detecting device 1 will be described.
<Specific Example of Process Performed in Change Detecting Device>
Specifically, as illustrated in
Then, the change detecting device 1 distributes each of the multiple vector values that correspond to the respective sentences included in the target document data, to multiple clusters based on the inter-vector distance in the graph represented in
Specifically, for example, the change detecting device 1 distributes each of the multiple vector values that correspond to the respective sentences included in the target document data, to multiple clusters, such that vectors relatively close to each other on the plane represented in
Here, when the respective sentences included in the target document data are distributed to clusters by using the vector values of the sentences as described above, the change detecting device 1 may not accurately distribute the sentences to the clusters.
Accordingly, for example, as illustrated in
Then, the change detecting device 1 distributes the respective sentences included in the target document data to clusters, based on the changing state of values in each of the time-series data generated in
Here, in order to distribute the sentences to clusters as described above, time-series data that exhibits a rough change may be used. Thus, when distributing the respective sentences included in the target document data to multiple clusters, the change detecting device 1 makes a determination in consideration of a relationship of each sentence with its previous and subsequent sentences.
In this regard, the range of sentences of which context needs to be considered varies according to the presence/absence of noise included in the target document data. Specifically, the range of sentences of which context needs to be considered varies according to the presence/absence of noise or the like caused from, for example, a method of writing the target document data or personal characteristics such as a speaking way when the contents written in the target document data are spoken. Further, the range of sentences of which context needs to be considered varies according to the presence/absence of noise or the like caused from, for example, a difference in domain (contents) between the target document data and the training document data.
Thus, when distributing the respective sentences included in the target document data to multiple clusters, the change detecting device 1 needs to make a determination while changing the range of sentences of which context needs to be considered.
Specifically, in this case, as illustrated in
More specifically, for example, as illustrated in
Further, for example, as illustrated in
Then, in the examples illustrated in
Thus, in this case, the change detecting device 1 distributes the respective sentences included in the target document data to clusters by using, for example, the time-series data (
However, as described above, when the distribution to clusters is performed while changing the range of sentences of which context needs to be considered, a relatively long time may be required according to the presence/absence of noise included in the target document data. Thus, the change detecting device 1 may require a relatively long time for detecting sentences related to the same topic in the target document data.
Thus, the change detecting device 1 according to the present embodiment calculates multiple vector values that correspond to the multiple sentences included in the target document data, respectively (hereinafter, also simply referred to as vectors), based on words included in each of the multiple sentences. Then, the change detecting device 1 performs a frequency analysis based on the multiple vector values, and the time axis associated with the multiple vector values according to the writing order of the multiple sentences in the target document data. Thereafter, the change detecting device 1 outputs information indicating a position that corresponds to a change point identified based on the result of the frequency analysis, in the target document data.
That is, for example, the change detecting device 1 performs the frequency analysis on the multiple vector values that correspond to the multiple sentences included in the target document data, respectively (hereinafter, also referred to as pre-extraction vector values), so as to detect a rough change for the pre-extraction vector values. Then, based on the detected rough change, the change detecting device 1 detects a portion of the target document data that is related to the same topic.
Specifically, for example, the change detecting device 1 expresses the pre-extraction vector values as time-series data according to the writing order in the target document data, and extracts low-frequency components in the time-series data. Here, the low-frequency components refer to frequency components that correspond to a frequency equal to or lower than a predetermined threshold value, and correspond to, for example, about 10% of the frequency components that correspond to the time-series data from the lowest frequency component. Then, the change detecting device 1 identifies the multiple vector values that correspond to the extracted low-frequency components (hereinafter, also referred to as post-extraction vector values), as vector values that exhibit a rough change for the pre-extraction vector values.
Then, the change detecting device 1 distributes each of the identified post-extraction vector values to multiple clusters, based on the similarity relationship thereof. Further, the change detecting device 1 identifies, for example, a set of sentences that correspond to vector values included in different clusters, among sets of sentences of which writing positions are adjacent to each other in the target document data, and detects the position between the sentences included in the identified set of sentences as a change point of the topic.
As a result, the change detecting device 1 may identify one or more sentences related to the same topic in the target document data, without considering the relationship thereof with previous and subsequent sentences included in the target document data. Thus, the change detecting device 1 may identify one or more sentences related to the same topic in the target document data at a relatively high speed.
Meanwhile, the frequency that corresponds to the low-frequency components described above is, for example, about 0 Hz to about 0.1 Hz in a case where each sentence included in the target document data is replaced in units of seconds.
<Hardware Configuration of Information Processing System>
Next, a hardware configuration of the information processing system 10 will be described.
As illustrated in
The storage medium 104 has a program storage area (not illustrated) where, for example, a program 110 for performing the change detecting process is to be stored. Further, the storage medium 104 has an information storage area 130 where, for example, information used when the change detecting process is performed is to be stored. Meanwhile, the storage medium 104 may be, for example, an HDD (hard disk drive) or an SSD (solid state drive).
The CPU 101 executes the program 110 loaded from the storage medium 104 into the memory 102 to perform the change detecting process.
Further, the communication device 103 communicates with the operation terminal 3 via, for example, a network NW.
<Function of Information Processing System>
Next, the function of the information processing system 10 will be described.
As illustrated in
Further, for example, as illustrated in
The information reception unit 111 receives the machine learning model 131 input by, for example, an operator via the operation terminal 3. The machine learning model 131 is a function calculated by using statistical information 131a about the appearance frequency of each word in the training document data (not illustrated). Further, the information reception unit 111 receives the document data 132 input by, for example, an operator via the operation terminal 3.
The information management unit 112 stores, for example, the machine learning model 131 received by the information reception unit 111 in the information storage area 130. Further, the information management unit 112 stores, for example, the document data 132 received by the information reception unit 111 in the information storage area 130.
Based on words included in each of the multiple sentences included in the document data 132 received by the information reception unit 111, the vector calculation unit 113 calculates multiple vector values 133 that correspond to the multiple sentences, respectively.
Specifically, the vector calculation unit 113 inputs the document data 132 stored in the information storage area 130 to the machine learning model 131 stored in the information storage area 130, so as to calculate the vector values that correspond to the multiple sentences included in the document data 132.
The analysis execution unit 114 performs the frequency analysis based on the multiple vector values 133 calculated by the vector calculation unit 113, and the time axis associated with the multiple vector values 133 according to the writing order of the multiple sentences in the document data 132.
Specifically, the analysis execution unit 114 performs, for example, a Fourier transform on the time-axis data of the multiple vector values 133 associated with the time axis (hereinafter, also referred to as first waveform data), so as to acquire frequency components that correspond to the vector values 133. Then, the analysis execution unit 114 extracts, for example, specific frequency components from the acquired frequency components. Then, the analysis execution unit 114 performs, for example, an inverse Fourier transform on the extracted specific frequency components, so as to acquire time-series data of the multiple vector values 133 associated with the time axis (hereinafter, also referred to as second waveform data).
The cluster generation unit 115 distributes the multiple vector values 133 that correspond to the second waveform data, to multiple clusters CL by using, for example, the mutual similarity of the multiple vector values 133 that correspond to the second waveform data.
For example, for each of the multiple clusters CL to which the vector values 133 are distributed by the cluster generation unit 115, the change point identifying unit 116 identifies the writing positions of the multiple sentences that correspond to the multiple vector values 133 distributed to each cluster CL, in the document data 132. Then, the change point identifying unit 116 identifies, for example, a set of sentences that correspond to the vector values 133 included in different clusters CL, among the sets of sentences of which writing positions are adjacent to each other in the document data 132. Thereafter, the change point identifying unit 116 identifies, for example, the position between the sentences included in the identified set, as a position that corresponds to a change point (a change point of the topic).
For example, the information output unit 117 outputs information indicating the position identified by the change point identifying unit 116 to the operation terminal 3 as information indicating a position that corresponds to a change point.
<Outline of First Embodiment>
Next, the outline of the first embodiment will be described.
As illustrated in
Then, when the timing for detecting a change comes (YES in S11), the change detecting device 1 calculates the multiple vector values 133 that correspond to the multiple sentences included in the document data 132, respectively, based on the words included in each of the multiple sentences (S12).
Subsequently, the change detecting device 1 performs the frequency analysis based on the multiple vector values 133 calculated in the process of S12, and the time axis associated with the multiple vector values 133 according to the writing order of the multiple sentences in the document data 132 (S13).
Then, the change detecting device 1 outputs information indicating the position that corresponds to a change point identified based on the result of the frequency analysis performed in the process of S13, in the document data 132 (S14).
That is, the change detecting device 1 performs, for example, the frequency analysis on the multiple vector values 133 that correspond to the multiple sentences included in the document data 132, respectively, so as to detect a rough change for the multiple vector values 133. Then, based on the detected rough change, the change detecting device 1 detects a portion of the document data 132 that is related to the same topic.
As a result, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132, without considering the relationship thereof with previous and subsequent sentences included in the document data 132. Thus, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132 at a high speed.
<Details of First Embodiment>
Next, the details of the first embodiment will be described.
<Information Managing Process>
First, in the change detecting process, a process of managing the machine learning model 131 (hereinafter, also referred to as an information managing process) will be described.
As illustrated in
Then, when the machine learning model 131 input by the operator via the operation terminal 3 is received (YES in S21), the information management unit 112 of the change detecting device 1 stores the received machine learning model 131 in the information storage area 130 (S22).
Meanwhile, for example, the information management unit 112 may be configured to generate the machine learning model 131 in its own device (the change detecting device 1) by a machine learning based on training document data (not illustrated). In this case, for example, the information management unit 112 may generate the machine learning model 131 by a machine learning based on the training document data which are similar in contents to the document data 132.
<Main Process of Change Detecting Process>
Next, the main process of the change detecting process will be described.
As illustrated in
<Specific Example of Document Data>
The document data 132 represented in
Further, the document data 132 represented in
Further, the document data 132 represented in
Referring back to
Specifically, when the input of each of the “k” number of sentences included in the document data 132 is received, the machine learning model 131 extracts nouns included in each of the “k” number of sentences. Then, the machine learning model 131 calculates the vector sequences 133a that correspond to the “k” number of sentences, respectively, by using the nouns extracted from each of the “k” number of sentences, and the statistical information 131a generated in advance as a result of the machine learning based on training document data (not illustrated). Thereafter, the machine learning model 131 outputs, for example, the “k” number of calculated vector sequences 133a. Hereinafter, a specific example of the statistical information 131a and a specific example of the process of S32 will be described.
<Specific Example of Statistical Information>
First, a specific example of the statistical information 131a will be described.
The statistical information 131a represented in
Specifically, for the information recorded in the first line of the statistical information 131a represented in
Further, for the information recorded in the second line of the statistical information 131a represented in
<Specific Example of Process of S32>
Next, a specific example of the process of S32 will be described.
For example, when the input of the document data 132 described with reference to
Then, the machine learning model 131 calculates the average value of the first weight values that correspond to, for example, the extracted words “baseball,” “Olympic,” “representative,” “players,” and “announcement,” respectively, as the first vector value 133 that corresponds to the sentence 132a. Further, the machine learning model 131 calculates the average value of the second weight values that correspond to, for example, the extracted words “Baseball,” “Olympic,” “Representative,” “Players,” and “Announcement,” respectively, as the second vector value 133 that corresponds to the sentence 132a.
Specifically, in the statistical information 131a described with reference to
Thus, for example, as represented in the first line of
Further, the machine learning model 131 calculates the vector values 133 that correspond to the other sentences including, for example, the sentences 132b, 132c, 132d, 132e, and 132f, respectively.
Specifically, for example, as represented in the second line of
Thereafter, the machine learning model 131 outputs the “k” number of vector sequences 133a that correspond to the “k” number of sentences including the sentence 132a and others.
Referring back to
Then, the analysis execution unit 114 extracts an i-th element in each of the “k” number of vector sequences 133a acquired in the process of S32 (S34).
Specifically, the analysis execution unit 114 extracts, for example, the first vector values 133 (the “k” number of vector values 133) included in the “k” number of vector sequences 133a described with reference to
Subsequently, the analysis execution unit 114 generates the vector sequences 133b configured by the “k” number of elements extracted in the process of S34 (S35).
Thereafter, the analysis execution unit 114 generates first waveform data WD1 that corresponds to the vector sequences 133a generated in the process of S35, according to the writing order of the “k” number of sentences in the document data 132 received in the process of S31 (S36).
That is, the analysis execution unit 114 generates time-series data of the vector values 133 in a case where each sentence in the document data 132 is written in a time-series order.
Specifically, for example, as illustrated in
Then, as illustrated in
Specifically, as illustrated in
Subsequently, the analysis execution unit 114 extracts specific frequency components FC from the frequency components FC acquired in the process of S41 (S42).
Specifically, for example, as illustrated in
Further, the analysis execution unit 114 performs the inverse Fourier transform on the frequency components extracted in the process of S42, so as to generate second waveform data WD2 that corresponds to the vector sequences 133a generated in the process of S35 (S43).
Specifically, as illustrated in
That is, the first waveform data WD1 generated by the process of S36 may include rough noise caused from a sentence of which topic cannot be identified (a sentence unrelated to a topic to be identified), a writing habit of a writer of the document data 132 or the like.
Further, for example, when the document data 132 received in the process of S31 is document data such as minutes of a meeting or the like, it may be determined that sentences corresponding to the same topic are collectively located in the document data 132.
Thus, for example, the analysis execution unit 114 may extract only the low-frequency components that correspond to the first waveform data WD1, and generate the second waveform data WD2 that corresponds to the extracted low-frequency components, so that it is possible to acquire waveform data that excludes the rough noise and expresses a rough change of the topic.
As a result, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132, without considering the relationship thereof with other sentences included in the document data 132. Thus, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132 at a high speed.
Thereafter, the analysis execution unit 114 determines whether “i” has reached “n” which is the number of vector values 133 included in the vector sequences 133a, respectively, acquired in the process of S32 (S44).
As a result, when it is determined that “i” has not reached “n” (NO in S44), the analysis execution unit 114 adds 1 to “i” (S45), and then, performs the process of S34 and subsequent processes.
Meanwhile, when it is determined that “i” has reached “n” (YES in S44), the cluster generation unit 115 of the change detecting device 1 distributes the multiple vector values 133 that correspond to the second waveform data WD2 generated in the process of S43 to the multiple clusters CL, by using the similarity of each of the multiple vector values 133 that correspond to the second waveform data WD2 generated in the process of S43, as illustrated in
Specifically, for example, as illustrated in
Then, for each of the multiple clusters CL to which sentences are distributed in the process of S51, the change point identifying unit 116 of the change detecting device 1 identifies the writing positions of the multiple sentences that correspond to the multiple vector values 133 included in each cluster CL, in the document data 132 (S52).
Subsequently, the change point identifying unit 116 identifies a set of sentences that correspond to the vector values 133 included in different clusters CL, respectively, among the sets of sentences of which writing positions identified in the process of S52 are adjacent to each other (S53).
Thereafter, the information output unit 117 of the change detecting device 1 outputs information indicating the position between the sentences included in the set identified in the process of S53, as information indicating a position that corresponds to a change point of the topic in the document data 132 (S54).
Meanwhile, the analysis execution unit 114 may, for example, extract high-frequency components of the frequency components FC in the process of S42. In this case, the analysis execution unit 114 may detect a change point where the topic changes significantly, in the document data 132.
As described above, the change detecting device 1 of the present embodiment calculates the multiple vector values 133 that correspond to the multiple sentences included in the document data 132, respectively, based on the words included in each of the multiple sentences. Then, the change detecting device 1 executes the frequency analysis based on the multiple vector values 133 and the time axis associated with the multiple vector values 133 according to the writing order of the multiple sentences in the document data 132. Thereafter, the change detecting device 1 outputs information indicating a position that corresponds to the change point identified based on the result of the frequency analysis, in the document data 132.
That is, the change detecting device 1 performs the frequency analysis on the multiple vector values (pre-extraction vector values) that correspond to the multiple sentences included in the document data 132, respectively, so as to detect a rough change for the pre-extraction vector values. Then, based on the detected rough change, the change detecting device 1 detects a portion of the document data 132 that is related to the same topic.
Specifically, the change detecting device 1 expresses, for example, the pre-extraction vector values as time-series data according to the writing order in the document data 132, and extracts low-frequency components in the time-series data. Then, the change detecting device 1 identifies the multiple vector values (post-extraction vector values) that correspond to the extracted low-frequency components, as vector values that indicate a rough change for the pre-extraction vector values.
Thereafter, the change detecting device 1 distributes each of the identified post-extraction vector values to the multiple clusters based on the similarity relationship thereof. Further, the change detecting device 1 identifies, for example, a set of sentences that correspond to vector values included in different clusters, respectively, among the sets of sentences of which writing positions are adjacent to each other in the document data 132, and detects the position between the sentences included in the identified set of sentences, as a change point of the topic.
As a result, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132, without considering the relationship thereof with previous and subsequent sentences included in the document data 132. Thus, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132 at a high speed. Specifically, for example, the change detecting device 1 may identify one or more sentences related to the same topic in the document data 132 in a quasi-linear time of the number of sentences included in the document data 132.
According to an aspect of the embodiment, a topic may be identified in units of a sentence.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to an illustrating of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-085172 | May 2020 | JP | national |