The entire disclosure of Japanese patent Application No. 2017-253009, filed on Dec. 28, 2017, is incorporated herein by reference in its entirety.
The present invention relates to a sentence scoring apparatus and a program capable of weighting documents.
There is a method of text mining that is a method of extracting useful information from a text (sentence). This method can be used to extract a word having a negative meaning such as “failure” for example from the text and make a group. Reading of this extracted text makes it possible to easily make confirmation targeted on useful information alone in the document without reading the entire document.
As a conventional technique of determining a sentence as an extraction target from a document, there is a method of dividing a sentence into words, and performing weighting to the entire sentence by using importance (weight value) of each of words.
Moreover, JP 2009-128967 A discloses a method of determining a noun and a predicate in a document and then performing weighting for each of the words on the basis of expressed content of the predicate with respect to the noun. This method sets a first weight value when a predicate for a specific noun is has a concept expressing a state change, sets a second weight value for a predicate expressing a concept of existence, and sets a third weight value when the predicate expresses a concept of existence in negative.
For example,
Meanwhile, there is a case, in weighting sentences, where it is more preferable to consider factors other than the content of sentences.
Unfortunately, however, the method described in JP 2009-128967 A and the conventional method perform weighting simply on the basis of the content of the sentence with no support of weighting in view of other information in a case of performing weighting on one sentence. Accordingly, document A and document B are weighted, in their text, with the same importance.
The present invention is intended to solve the above problem, and an object is to provide a sentence scoring apparatus and a program capable of weighting a sentence in a document having a hierarchical structure in view of information other than the sentence.
To achieve the abovementioned object, according to an aspect of the present invention, a sentence scoring apparatus reflecting one aspect of the present invention comprises a hardware processor that: extracts a sentence from a document having a hierarchical structure; derives a first weight value corresponding to a title of a hierarchical layer above a hierarchical layer to which the sentence extracted by the hardware processor belongs; extracts a keyword included in the sentence; derives a second weight value of the sentence on the basis of the extracted keyword; and determines a weight value of the sentence on the basis of the first weight value and the second weight value.
The advantages and features provided by one or more embodiments of the invention will become more fully understood from the detailed description given hereinbelow and the appended drawings which are given by way of illustration only, and thus are not intended as a definition of the limits of the present invention:
Hereinafter, one or more embodiments of the present invention will be described with reference to the drawings. However, the scope of the invention is not limited to the disclosed embodiments.
The PC 5 is a terminal device such as a personal computer used by a user. The PC 5 includes a central processing unit (CPU), a read only memory (ROM), and a random access memory (RAM), and operates on the basis of various programs such as operating system (OS) and application programs. In an embodiment of the present invention, the PC 5 creates and saves a document, inputs a document to the server 10, and requests scoring of a sentence in the input document.
After receiving a document input from the PC 5 and a request for scoring a sentence in the document, the server 10 extracts the sentence from the document and performs scoring. The document to be input to the server 10 is assumed to be a document having a hierarchical structure having classification of a chapter, a section, a subsection, a text, or the like.
In the scoring in the embodiment of the present invention, a keyword is detected from a sentence and a second weight value corresponding to the keyword is derived. Furthermore, a first weight value is derived in accordance with the title of the hierarchical layer above the hierarchical layer to which the sentence belongs. Subsequently, the weight value of the sentence is determined on the basis of the first weight value and the second weight value. The title of the hierarchical layer to which the sentence belongs and the title of the higher hierarchical layer in higher order is likely to include information related to the sentence, such as a theme name, affiliated project name, and phase, for example. Accordingly, by performing scoring not only in view of sentences but also in view of the information, it is possible to perform to achieve scoring that fits actual situation.
In the embodiment of the present invention, scoring is performed in view of the duration of a matter indicated by a sentence. In a case where the content of the sentence is related to solution of a problem and if the duration of the matter (subject matter) indicated by the sentence is long, it is presumed that the current problem cannot be solved easily or shortly. In this case, it is desirable to give high importance to this sentence because of the difficulty in solving the problem. On the contrary, if the duration of a matter indicated by a sentence is short, there is a high possibility that it can be easily solved. In this case, there is less necessity to give a higher importance to the sentence. Therefore, it is possible to perform scoring in accordance with such actual situation as compared with a case where scoring is performed on the basis simply of character strings in the sentence.
The CPU 11 executes middleware, application programs or the like on the basis of an OS program. The ROM 12 and the hard disk device 15 store various programs. The CPU 11 executes various types of processing in accordance with these programs, thereby implementing each of functions of the server 10.
The RAM 13 is used as a work memory that temporarily stores various data when the CPU 11 executes processing on the basis of the program, or as an image memory that stores image data.
The nonvolatile memory 14 is a memory (flash memory) that maintains stored content even when the power supply is turned off and it is used for storing various types of setting information or the like. The hard disk device 15 is a large-capacity nonvolatile storage device, and stores various types of programs and data in addition to image data or the like. In the embodiment of the present invention, a document input from the PC 5, a history of the scoring document, each of keywords and its weight value, or the like, are stored.
The network communication unit 16 functions to communicate with the PC 5 and other external devices via the network 3.
In the embodiment of the present invention, the CPU 11 functions as a sentence extracting unit 30 that extracts a sentence from a document having a hierarchical structure, an extracting unit 34 that extracts a keyword included in a sentence, a second weight value deriving unit 35 that derives a second weight value on the basis of the extracted keyword, a first weight value deriving unit 33 that derives a first weight value according to a title of a hierarchical layer above a hierarchical layer to which a sentence belongs, and a weight value determination unit 36 that determines a weight value of the sentence on the basis of the first weight value and the second weight value.
Note that the CPU 11 also functions as a matter specifying unit 31 that specifies a matter indicated by a sentence, a duration acquisition unit 32 that acquires a duration of the matter, a third weight value deriving unit 37 that derives a third weight value of the sentence on the basis of the acquired duration.
In the embodiment of the present invention, the server 10 first extracts a sentence from a document, and then performs scoring of the sentence on the basis of the content of the sentence. In this case, the scoring is performed by using keywords contained in the sentence and titles of the hierarchical layers above the hierarchical layer to which the sentence belongs. Thereafter, the weight value (final score) of the final sentence is calculated by using the weight value based on the duration of the matter indicated by the sentence. Each of processing performed for calculation of the final score will be described.
First, a method of extracting a sentence from a document having a hierarchical structure will be described.
A document 100 of
First product development department Creation date and time: Apr. 21, 2017
1. Theme A
2. Theme B
By dividing this document at each of punctuation marks and line feeds, it is possible to extract the following Sentences 1 to 11.
Sentence 1: First product development department Creation date and time: Apr. 21, 2017
Sentence 2: 1. Theme A
Sentence 3: 1-1 Product development
Sentence 4: Development completed
Sentence 5: 1-2 Market
Sentence 6: Frequent occurrence of paper wrinkle problem at customer ∘∘
Sentence 7: 2. Theme B
Sentence 8: 2-1 Technology development
Sentence 9: Partial incompleteness in fixation failure countermeasure and re-countermeasures are underway
Sentence 10: 2-2 Market
Sentence 11: Frequent occurrence of paper wrinkle problem in initial lot.
The server 10 analyzes the structure of the document when extracting sentences from the document 100. While any method may be used as a method of analyzing the document structure, the method in the embodiment of the present invention determines to which of a chapter, a section, a subsection, or text each of the sentences belongs and analyzes their hierarchical structures on the basis of the indentation, assignment method of serial numbers, or the like.
Next, the server 10 detects keywords and titles as extraction targets related to scoring in each of the sentences. In the embodiment of the present invention, the server 10 has preliminarily registered character strings to be the keywords and titles as extraction targets. In a case where the registered character string exists in the sentence, the server 10 detects the character string. A weight value is preliminarily set for each of the registered character strings, and the weight value is used for calculating the weight value of a sentence.
In the embodiment of the present invention, a keyword can have a modifying-modified relationship with another keyword, and thus, keywords are classified into a keyword as a subject (keyword (modifying) in the figure) of a succeeding keyword and a keyword as a predicate of the preceding keyword (keyword (modified) in the figure).
In
In
“Paper wrinkle”→1
“Fixation”→1
“Cost”→3
“Occurrence”→3
“Frequent occurrence”→5
“Failure”→5
“Theme A”→2
“Theme B”→1.5
“Theme C”→1.1
“Market”→2
“Product development”→1.5
“Technology development”→1.1
Next, a method of scoring sentences on the basis of keywords and titles will be described. In the embodiment of the present invention, the server 10 selectively defines a sentence that contains both the keyword (modifying) and the keyword (modified) as a scoring target.
In the embodiment of the present invention, in a case where scoring a sentence, a weight value corresponding to a title of a hierarchical layer above the hierarchical layer to which the sentence belongs is to be used for scoring the sentence. Although the calculation formula here is
“weight value of(keyword(modifying)+weight value of keyword(modified))×weight value of title(theme name)×weight value of title(phase)”
the calculation formula at the time of scoring is not limited to this, and other calculation formulas may be used.
Sentence 6 contains a keyword (subject) being “paper wrinkle”, and a keyword (received) being “frequent occurrence”, and the titles of the hierarchical layer above the hierarchical layer at which sentence 6 is located are “theme A” and “market”. When the weight values corresponding to these character strings are applied to the above calculation formula, the score would be “24”. By using a similar method, sentence 9 is calculated to be the scores of “13.5” and sentence 11 is calculated to be the score of “18”.
In such a case, first an average value of remaining weighted values excluding the maximum value among the individual weight values of the extracted themes (theme A, theme B, theme C) is calculated. Subsequently, this average value is added to the maximum value and the result of this is to be adopted as a weight value representing these titles.
In this example, the weight values have a relationship of theme A>theme B>theme C, and thus, the following expression is applicable.
Theme A+(theme B+theme C)/2=2+(1.5+1.1)/2=33.
The value 3.3 calculated here is to be used as a weight value representing the theme name to perform scoring of the sentences. While the embodiment of the present invention uses such a countermeasure, the method to manage the case where a plurality of titles is included in the same hierarchical level is not limited thereto.
In
In
In this manner, the type of the title to be used for scoring may be determined beforehand, or the title of the hierarchical layer closer to the hierarchical layer to which the sentence belongs may be prioritized among the hierarchical layers above the hierarchical layer to which the sentence belongs. For example, when there is a title in the hierarchical layer to which the sentence belongs, a weight value corresponding to the title is derived. When there is no title, the presence or absence of the title the hierarchical layer immediately above is examined. When there is a title there, a weight value corresponding to the title is derived. When there is no title, the presence or absence of the title of the next higher hierarchical layer is examined. In this manner, the title of the closest hierarchical layer in a hierarchical layer above the hierarchical layer to which the sentence belongs may be used for scoring.
Alternatively, in the case of performing scoring on the basis of titles of a plurality of hierarchical layers, it is allowable to total the weight value of the title of the closest hierarchical level and the weight value of the title of the next closest hierarchical level with respect to the hierarchical layer to which the sentence as a scoring target belongs, with weights corresponding to the order how close to the target layer (priority order).
After completion of scoring by using one keyword or title toward a sentence, the matter indicated by the sentence is specified, and at the same time, the duration of that matter is acquired, and then, a final weight value (final score) of the sentence is calculated by using the weight value corresponding to the acquired duration. First, a method of identifying matters will be described.
In a case where scoring is performed with a keyword or a title, the server 10 registers a combination of the keyword, the title, various types of information related to the sentence, or the like, used for the scoring as scoring history in association with the creation date and time of the scored sentence. The scoring history functions as a sentence creation history in the present invention. Various types of information related to the sentences are assumed to be the department name. The server 10 specifies the matters indicated by the sentences by using the combination of the registered keywords, themes, phases, and department names.
The department name and the date and time in the scoring history 110 are acquired from a header, a footer, character strings in a specific region in the document, the property of the document, the file name, the file information, or the like. Acquisition of these may be performed by other methods. For example, when a sentence is extracted from a document 100 of
In a case of acquiring a duration for a matter indicated by a certain sentence, first examination is made whether there is a record in which all of “keyword”, “title (theme name, phase, and the like)” and “department name” in the scoring history match those of the sentence as a scoring target, and when there is a matching record, it is judged that the sentence indicated by the record and the sentence as a scoring target are sentences related to a common matter. Accordingly, a temporal difference between the date and time of the record having the oldest date and time out of the records having matters matching with the sentence as a scoring target and the creation date and time of the sentence as a scoring target is extracted, and this extracted difference is defined as the duration of the matter indicated by the sentence as the scoring target.
In the embodiment of the present invention, it is judged to be a record of the sentence indicating the matter common to the sentence as the scoring target only in a case where all the combinations of“keyword”, “title (theme name, phase, and the like),” and “department name” are perfectly matched. However, it is also allowable to judge that it is a record of the sentence indicating the common matter in a case where at least a part of the combinations achieves a match (for example, in a case where the “keyword” and “title” match).
In the embodiment of the present invention, a weight value corresponding to the duration is preliminarily set
In
Regarding the sentence concerning a matter having a duration, a score calculated on the basis of a keyword or a title is multiplied by a weight value according to the duration so as to calculate a final score. In
Next, a case where a matter which has been completed once in the past occurs again will be described. First, the server 10 presets and saves character strings such as “completion”, “completed”, “closed”, and the like, for discriminating whether the matter indicated by the sentence is completed or not. When an expression indicating completion is detected in the sentence at the time of scoring the sentence, information indicating that the matter is completed is also registered to the scoring history at a registration of the matter indicated by the sentence.
Next, a method of acquiring the duration of a matter in view of the above-described “completed” record will be described.
In
In
Next, a case where scoring is performed in view of the number of times of recurrence of a matter will be described. In the case of a record of a sentence indicating a matter common to the matters indicated by the sentence and in a case where the record indicating completion is registered in the scoring history, the number of completed records is regarded as the number of times of recurrence of the matter, and the number or records completed is multiplied by a coefficient corresponding to the number of times of recurrence, at the time of calculating the final score.
When the number of completed records is one, the number of times of recurrence is set to once, and when the number of completed records is two, the number of times of recurrence is set to twice.
For example, since the same matter has already been completed once at the time of creating a sentence related to the record of “2017/04/21” in
In this manner, the server 10 performs scoring on the sentence and calculates the final score. Scoring is performed in view of not only keywords in the sentences but also the title of the hierarchical layer above the hierarchical layer at which the sentence is located, the duration of the matters indicated by the sentences, and the number of times of recurrence. Accordingly, it is possible to perform scoring to fit the actual situation compared with the case of performing scoring simply using the keywords in the sentence.
Next, a flow of processing performed by the server 10 according to the embodiment of the present invention will be described.
First, in step S101 of
Next, examination is performed so as to whether there is a title of a predetermined type such as “theme name” in the title of the hierarchical layer above the hierarchical layer at which the sentence is located (step S104). In a case where there is no title of a predetermined type (step S104; NO), the processing proceeds to step S108. In a case where there is a title of a predetermined type (step S104; Yes), the weight value preset in the title is acquired (step S105).
In a case where the number of the titles detected in step S104 is singular (step S106; No), the processing proceeds to step S108. In a case where the plurality of titles is detected in step S104 in parallel (step S106; Yes), the weight values representing the plurality of titles are calculated by the method described in
In step S108, scoring is performed with the keywords and titles by using the calculation method described with reference to
When registering a matter indicated by a sentence in the scoring history, as described in
In step S201 of
After a record of a common matter is extracted (step S201; Yes), examination is made as to whether there is a completed record (step S202).
In a case where there is a completed record (step S202; Yes), the record before the completion is excluded (step S203), and the processing proceeds to step S204. In a case where there is no completed record (step S202; No), the processing proceeds to step S204.
In step S204, the record with the oldest date and time is extracted from the extracted records. In a case where the record before completion has been excluded in step S203, the record with the oldest date and time would be extracted from the remaining records. Thereafter, a temporal difference between the date and time of the extracted record and the present is calculated (step S205), and the weight value of the duration of the matter indicated by the sentence as a scoring target is acquired from the calculation result (step S206).
Thereafter, the final score is calculated from the score calculated in step S108 of
In addition, in step S104 of the flow of
In a case where there is a completed record (step S301; Yes), a weight value (coefficient) corresponding to the number of completed records (number of times of recurrence) is acquired (step S302), and then, the acquired weight value is multiplied with the final cored calculated in step S207 to re-calculate the final score (step S303), so as to finish the current processing.
Note that the processing in
Although the embodiments of the present invention have been described with reference to the drawings, specific configurations are not limited to those illustrated in the embodiments, and modifications and additions within the scope not deviating from the spirit of the present invention are also to be included in the present invention.
In the embodiment of the present invention, the server 10 has functions as the sentence scoring apparatus of the present invention, but the sentence scoring apparatus is not limited thereto. For example, other devices such as the PC 5 or an MFP may serve as the sentence scoring apparatus.
The method of extracting sentences from documents and the method of extracting keywords, titles or the like are not limited to those described in the embodiment of the present invention. Moreover, keywords, titles or the like are not limited to those described in the present invention. The calculation formula for scoring is not limited to that described in the embodiment. While the embodiment of the present invention uses the preset weight values (coefficients) of the keyword, the title, the duration, the number of times of recurrence or the like, they may be changeable by the user.
The method of acquiring the duration is not limited to the method described in the embodiment of the present invention. For example, the duration may be acquired by inquiring to another server or the like in which the situation of the matter indicated by the sentence is recorded. Further, the method of specifying the matter is not limited to the method described in the embodiment of the invention. A keyword other than the keyword related to the scoring may be used or a combination may be used to specify the matter, or a keyword or a theme used for scoring may partially be specified by a combination of elements.
In the embodiment of the present invention, scoring is performed in view of the duration of a matter indicated by a sentence. However, scoring of the sentence may be performed only with the use of the title of the hierarchical layer above the hierarchical layer at which the keyword and the sentence are located.
In the embodiment of the present invention, the type of the title of the hierarchical layer above the hierarchical layer at which the sentence is located is “theme name”, “phase”, or the like. However, it is allowable to use a “product name”, a “project name”, a “negotiation name”, a “department name”, “information of person in charge”, “creation date”, or the like. It suffices to include one of them.
A duration of a matter indicated by a sentence may be acquired using a sentence creation history different from the scoring history. This creation history may be any database as long as it can specify the creation date and matters of documents and sentences that have been created so far.
Although the embodiment of the present invention is a case where the longer the duration, the larger the weight value, it is allowable to configure such that the shorter the duration, the larger the weight value. Alternatively, the weight value may be increased as the duration becomes longer while the duration is less than a predetermined period, and the weight value may be decreased as the duration becomes longer in a case where the duration exceeds a predetermined period (that is, the weight value may be lowered in case of a prolonged and constant state). Furthermore, the relationship between the duration and the weight value may be set to any setting such that the weight value rapidly changes at a point after exceeding a certain period of time.
Although embodiments of the present invention have been described and illustrated in detail, the disclosed embodiments are made for purposes of illustration and example only and not limitation. The scope of the present invention should be interpreted by terms of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2017-253009 | Dec 2017 | JP | national |