The entire disclosure of Japanese patent Application No. 2017-253028, filed on Dec. 28, 2017, is incorporated herein by reference in its entirety.
The present invention relates to a sentence scoring device and program capable of weighting a document.
There is a method of text mining as a method of extracting useful information from text (sentences). According to this method, for example, a word and the like having a negative meaning, such as “defect”, can be extracted from text and put together. By reading the extracted part, only useful information in a document can be checked easily without reading the whole document.
As to how to determine a sentence to be extracted in a document, there is, for example, a prior art of a method that divides a sentence into words, and weights the entire sentence by using a degree of importance (weighting value) of each of the words.
JP 2009-128967 A discloses a method of determining a noun and a predicate in a document, and weighting each noun based on an expressed content of a predicate with respect to the noun. In this method, when a predicate with respect to a specific noun is a predicate of a concept expressing a state change, a first weighting value is set to the noun. When the predicate expresses a concept of existence or non-existence and is affirmative, a second weighting value is set to the noun. When the predicate expresses a concept of existence or non-existence and is negative, a third weighting value is set to the noun.
For example,
When a sentence is weighted, there is a case where factors other than a content of the sentence are preferably considered.
However, the method described in JP 2009-128967 A and conventional methods perform weighting by setting the same degree of importance to the documents A and B, since such methods perform weighting based only on a content of a document, and do not support weighting in consideration of other external factors, such as a situation of a matter described in a document.
To solve the above problem, an object of the present invention is to provide a sentence scoring device and a program thereof that can perform weighting in consideration of a situation of a matter shown by a sentence.
To achieve the abovementioned object, according to an aspect of the present invention, a sentence scoring device reflecting one aspect of the present invention comprises a hardware processor that: extracts a sentence from a document; identifies a matter shown by the sentence; acquires a continuing period of the identified matter; derives a first weighting value of the sentence based on the acquired continuing period; extracts a keyword included in the sentence; derives a second weighting value of the sentence based on the extracted keyword; and determines a weighting value of the sentence based on the first weighting value and the second weighting value.
The advantages and features provided by one or more embodiments of the invention will become more fully understood from the detailed description given hereinbelow and the appended drawings which are given by way of illustration only, and thus are not intended as a definition of the limits of the present invention;
Hereinafter, one or more embodiments of the present invention will be described with reference to the drawings. However, the scope of the invention is not limited to the disclosed embodiments.
The PC 5 is a terminal device, such as a personal computer, used by the user. The PC 5 includes a central processing unit (CPU), a read only memory (ROM), a random access memory (RAM), and the like, and operates based on an operating system (OS) and a variety of programs, such as an application program. In the embodiment of the present invention, the PC 5 creates and stores a document, and inputs a document into the server 10 and requests scoring of a sentence in the input document.
Upon input of a document from the PC 5 and receiving a request for scoring a sentence in the document, the server 10 extracts a sentence from the document and performs scoring. In the scoring according to the embodiment of the present invention, a matter shown by an extracted sentence is identified first, and after a continuing period of the matter is acquired, a first weighting value of the sentence is derived based on the acquired continuing period. Next, after a keyword included in a sentence is extracted, a second weighting value of the sentence is derived based on the extracted keyword. A final weighting value of a sentence is determined based on the first weighting value and the second weighting value. The method of identifying a matter, the method of calculating a continuing period of the matter, and the like will he described later.
As described above, when performing scoring for one sentence, the server 10 performs scoring in consideration of not only a content of the sentence but also a continuing period of a matter shown by the sentence. For example, when a content of a sentence relates to solving a problem, and a continuing period of a matter shown by the sentence (a target problem) is long, it is expected that the occurred problem has not been solved yet and is prolonged. Accordingly, the degree of importance is preferably set to be high in view of difficulty of solving the problem. In contrast, when the continuing period of the matter shown by the sentence is short, there is high possibility that the problem can be solved easily. Accordingly, the need for setting a high degree of importance is low. Accordingly, scoring can be performed more in accordance with such an actual situation as compared with a case where scoring is performed based only on a content of a sentence.
The CPU 11 executes middleware, an application program, and the like based on an OS program. The ROM 12 and the hard disk device 15 store a variety of programs, and the CPU 11 executes a variety of types of processing in accordance with the programs, so that functions of the server 10 are performed.
The RAM 13 is used, for example, as a work memory that temporarily stores a variety of types of data when the CPU 11 executes processing based on a program and an image memory that stores image data.
The non-volatile memory 14 is a memory (flash memory) whose stored content is not destroyed even when power is turned off; and is used fir storing a variety of types of setting information and the like. The hard disk device 15 is a large-capacity and non-volatile storage device, and stores image data, and the like as well as a variety of types of programs and data. In the embodiment of the present invention, the hard disk device 15 stores a document input by the PC 5, a history of a scored document, keywords and weighting values of keywords, and the like.
The network communication part 16 performs a function of communicating with the PC 5 and other external devices through the network 3.
In the embodiment of the present invention, the CPU 11 plays a role of a sentence extractor 30 that extracts a sentence from a document, a matter identifier 31 that identifies a matter shown by a sentence, a continuing period acquirer 32 that acquires a continuing period of a matter, a first weighting value derivation part 33 that derives a first weighting value of a sentence based on the acquired continuing period, an extractor 34 that extracts a keyword included in a sentence, a second weighting value derivation part 35 that derives a second weighting value of the sentence based on the extracted keyword, a weighting value determiner 36 that determines a weighting value of a sentence based on the first weighting value and the second weighting value, and a third weighting value derivation part 37 that derives a third weighting value corresponding to an identification item to which a sentence is connected.
In the embodiment of the present invention, the server 10 first extracts a sentence from a document, and then performs scoring of the sentence based on a content of the sentence. In this case, scoring is performed based on a keyword included in a sentence, a title related to the sentence, and the like. After that, a weighting value based on a continuing period of a matter shown by the sentence is used to calculate a final weighting value (final score) of the sentence. Processing performed until calculation of a final score will be described.
First, a method of extracting a sentence from a document will be described.
A document 100 of
When the document is divided at each punctuation mark and new line, the following sentences 1 to 11 can be extracted:
The server 10 analyzes a structure of the document 100 when extracting a sentence from the document 100. A method of analyzing a document structure may be any method. The embodiment of the present invention analyzes which of a chapter, a section, a paragraph, main text, and the like each sentence corresponds to based on, for example, how an indent and a serial number are attached, and a layer structure of the sentences.
Next, the server 10 detects a keyword and a title to be extracted that are related to scoring of each sentence. In the embodiment of the present invention, a character string which is a keyword and a title to be extracted is registered in the server 10 in advance. When the registered character string is in a sentence, the character string is detected. A weighting value is set to each registered character string in advance, and the weighting value is used to calculate a weighting value of a sentence.
In the embodiment of the present invention, a keyword may be in an influential relationship with other keywords. There is a keyword (keyword (influencing) in the diagram) that influences a succeeding keyword and a keyword (keyword (influenced) in the diagram) that is influenced by a preceding keyword.
In
“paper wrinkle”→1
“fixing”→1
“cost”→3
“occur”→3
“occurs frequently”→5
“failure”→5
“Theme A”→2
“Theme B”→1.5
“Theme C”→1.1
“market”→2
“product development”→1.5
“technology development”→1.1
Next, a method of scoring a sentence based on a keyword and a title will be described. In the embodiment of the present invention, the server 10 performs scoring only for a sentence that includes both the keyword (influencing) and the keyword (influenced).
In the embodiment of the present invention, when scoring of a sentence is performed, a weighting value corresponding to a title of a layer, to which the sentence relate, or a higher layer, is used for scoring of the sentence. A calculation formula in this case is
“(weighting value of keyword (influencing)+weighting value of keyword (influenced)×weighting value of title (theme name)×weighting value of title (phase)”
however, the calculation formula used at the time of scoring is not limited to the above, and tray be other calculation formulas.
Sentence 6 includes the keyword (influencing) “paper wrinkle” and the keyword (influenced) “occurs frequently”, and titles of layers higher than or equal to a layer on which Sentence 6 is positioned are “Theme A” and “market”. When weighting values corresponding to these character strings are substituted into the above calculation formula, the score of “24” is obtained. By a similar method, the score of “13.5” is calculated from Sentence 9, and the score of “18” is calculated from Sentence 11.
In this case, a value obtained by adding a largest value of weighting values of single ones of the extracted themes (Theme A, Theme B, and Theme C) to an average value of the remaining weighting values excluding the largest value is used as a weighting value representing titles of them of the themes. In this example, since Theme A>Theme B>Theme C, the following equation is obtained:
Theme A+(Theme B+Theme C)÷2=2+(1.5+1.1)÷2=3.3
The calculated value 3.3 is used as a weighting value representing the theme names to perform scoring of the sentence. The embodiment of the present invention handles the case in the above manner. However, the method of handling the case where a plurality of titles is included on the same layer is not limited to the above.
In
In
As described above, a type of a title to be used for scoring may be determined in advance, or a title of a layer of a sentence to be scored, or a title on one layer higher than that of the sentence may he determined to be used.
When scoring based on a keyword and a title is completed for one sentence, a matter shown by the sentence is identified, and a continuing period of the matter is acquired. A weighting value corresponding to the acquired continuing period is used to calculate a final weighting value (final score) of the sentence. First, an identifying method of a matter will be described.
When performing scoring based on a keyword and a title, the server 10 registers a combination of a keyword and a title used for the scoring, a variety of types of information relating to the sentence as a scoring history in association with date and time of creation of the scored sentence. The scoring history plays a rote as a history of creation of a sentence in the present invention. A variety of types of information relating to a sentence is assumed to be a department name in this example. In the server 10, a matter shown by a sentence is identified based on a combination of the registered keyword, theme, phase, and department name.
A department name and date and time in the scoring history 110 are acquired from a header, a footer, a character string in a specific area in a document, property of a document, a file name, file information, and the like. A department name, and date and time may be acquired by other methods. For example, when a sentence is extracted from the document 100 of
Consider a case Where a continuing period is acquired for a matter shown by a certain sentence. First, when there is a record in a scoring history in which all “keyword”, “title (theme name, phase, or the like)”, and “department name” match with those in a sentence to be scored, the sentence indicated by the record and the sentence to be scored are determined as sentences relating to a common matter. Accordingly, a temporal difference between date and time of an oldest one of records relating to a matter that matches with that shown by a sentence to be scored and date and time of creation of the sentence to be scored is extracted, and the extracted temporal difference is used as a continuing period of a matter shown by the sentence to be scored.
In the embodiment of the present invention, a record is determined as that for a sentence showing a matter common to a sentence to be score only when a combination of all of “keyword”, “title (theme name, phase, or the like)”, and “department name” completely matches. However, the configuration may be such that a record is determines as that for a sentence showing a common matter when part of the combination matches (for example, “keyword” and “title” match),
In the embodiment of the present invention, a weighting value corresponding to a continuing period is set in advance.
In
For a sentence relating to a matter having a continuing period, a weighting value corresponding to the continuing period is multiplied by a score calculated based on a keyword and a title, so that a final score is calculated. In
Next, a case where a matter that has once been completed in the past occurs again will be described. First, the server 10 sets and stores in advance expressions for distinguishing between whether or not a matter shown by a sentence is completed, such as character strings of “completed”, “has been”, and “closed”. When an expression indicating completion is detected in a sentence during scoring of the sentence, and a matter shown by the sentence is registered in association with a fact that the matter has been completed.
Next, a method of acquiring a continuing period of a matter in consideration of a record of “has been completed” described above will be described.
In
In
Next, a case where scoring is performed in consideration of the number of times of recurrence of a matter will be described. A record of a sentence that shows a matter common to that shown by a sentence and shows that the matter has been completed is registered in a scoring history, the number of records showing that the matter has been completed is assumed to be the number of times of recurrence of the matter, and a coefficient corresponding to the number of times of recurrence is multiplied at the time of calculation of a final score.
When the number of records showing the matter has been completed is one, the number of times of recurrence is one. When the number of records showing the matter has been completed is two, the number of times of recurrence is two.
For example, when the sentence relating to the record of “2017/04/21” of
The server 10 performs scoring for a sentence and calculates a final score in the manner described above. Since scoring is performed in consideration of not only a keyword in a sentence, but also a title of a layer higher than or equal to a layer on which the sentence is positioned, a continuing period of a matter shown by the sentence, the number of times of recurrence, and the like, scoring that more reflects an actual situation can be performed as compared with a case where scoring is performed only based on a keyword in a sentence.
Next, a process of processing performed by the server 10 according to the embodiment of the present invention will be described.
First, in Step S101 of
Next, whether or not there is a title of a type determined in advance, such as “theme name”, in a title of a layer higher than or equal to a layer on which a sentence is positioned is checked (Step S104). When there is not a title of a type determined in advance (Step S104; NO), the processing proceeds to Step S108. When there is a title of a type determined in advance (Step S104; Yes), a weighting value set to the title in advance is acquired (Step S105).
When a single title is detected in Step S104 (Step S106; No), the processing proceeds to Step S108. When a plurality of titles arranged in parallel are detected in Step S104 (Step S106; Yes), a weighting value representing the titles is calculated by the method described in
In Step S108, scoring based on a keyword and a title is performed by the calculation method described in
When a matter shown by a sentence is registered in a scoring history, the matter may be registered in association with other pieces of information, such as a department name, as an element that identifies the matter as described in
In Step S201 of
When records of a common matter are extracted (Step S201; Yes), whether or not there is a record showing that the matter has been completed among the records is checked (Step S202).
When there is a record showing that the matter has been completed (Step S202; Yes), a record prior to the record showing that the matter has been completed is excluded (Step S203), and the processing proceeds to Step S204. When there is not a record showing that the matter has been completed (Step S202; No), the processing proceeds to Step S204.
in Step S204, a record of oldest date and time is extracted from extracted records. When a record prior to the record showing that the matter has been completed is excluded in Step S203, a record of oldest date and time is extracted from the remaining records. After that, a temporal difference between date and time of the extracted record and the present is extracted (Step S205), and a weighting value of a continuing period of a matter shown by a sentence to be scored is acquired from the calculation result (Step S206).
After the above, a final score is calculated by the method described in
In Step S104 of the flowchart of
When there is a record showing that a matter has been completed (Step S301; Yes), a weighting value (coefficient) corresponding to the number of records showing that a matter has been completed (the number of times of recurrence) is acquired (Step S302), the final score calculated in Step S207 is multiplied by the weighting value to calculate a final score again (Step S303), and the present processing is finished.
The processing of
The embodiment of the present invention has been described above with reference to the drawings. However, a specific configuration is not limited to the embodiment, and a change or an addition within a range not deviating from the gist of the present invention is also included in the present invention.
In the embodiment of the present invention, the server 10 plays a role as a sentence scoring device of the present invention. However, the sentence scoring device is not limited to the above. For example, other devices, such as the PC 5 and an MFP, may play a role as the sentence scoring device.
A method of extracting a sentence from a document and a method of extracting a keyword, a title, and the like are not limited to those described in the embodiment of the present invention. A keyword, a title, and the like are not limited to those described in the present invention. A calculation formula used for scoring is not limited to the one described in the embodiment. In the embodiment of the present invention, weighting values (coefficients) of a keyword, a title, a continuing period, the number of times of recurrence, and the like are set in advance. However, the weighting values may be changeable by the user.
The method of acquiring a continuing period is not limited to the method described in the embodiment of the present invention. For example, the continuing period may be acquired by a method, such as inquiring another server and the like in which a situation of a matter shown by a sentence is recorded. The method of identifying a matter is not limited to the method described in the embodiment of the invention. A matter may be identified by using or combining keywords other than a keyword relating to scoring, or a matter may be identified by a combination of elements of part of a keyword and a theme used for scoring.
In the embodiment of the present invention, scoring of a sentence is performed by using a weighting value of a title of a layer higher than or equal to a layer on Which the sentence is positioned. However, scoring of the sentence may be performed only based on a keyword and a continuing period of a matter shown, by the sentence.
In the embodiment of the present invention, types of a title of a layer higher than or equal to a layer on which a sentence is positioned are “theme name”, “phase”, and the like. However, the types of a title may be “product name”, “project name”, “negotiation name”, “department name”, “information of person in charge”, “date of creation”, and the like. The type of a title only needs to include any one of them.
A creation history of a sentence different from a scoring history may also be used to acquire a continuing period of a matter shown by a sentence. This creation history is preferably a database with which a document created in the past, a creation date of a sentence, and a matter may be identified.
In the embodiment of the present invention, a weighting value is larger as a continuing period is longer. Alternatively, a weighting value may be larger as a continuing period is shorter. The configuration may also be such that, while a continuing period is shorter than a predetermined period, a weighting value is made larger as the continuing period becomes longer, and when the continuing period exceeds a predetermined period, a weighting value is made smaller as the continuing period becomes longer (that is, a weighting value is lowered when a continuing period is constantly long). A relationship between a continuing period and a weighting value may also be such that a weighting value is rapidly changed as the continuing period exceeds a certain period, and may be set optionally.
Although embodiments of the present invention have been described and illustrated in detail, the disclosed embodiments are made for purposes of illustration and example only and not limitation. The scope of the present invention should be interpreted by terms of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2017-253028 | Dec 2017 | JP | national |