This application is a National Stage application under 35 U.S.C. § 371 of International Application No. PCT/JP2019/034864, having an International Filing Date of Sep. 4, 2019, which claims priority to Japanese Application Serial No. 2018-174531, filed on Sep. 19, 2018. The disclosure of the prior application is considered part of the disclosure of this application, and is incorporated in its entirety into this application.
The present invention relates to a learning apparatus, an extraction apparatus, and a learning method.
Conventionally, in a software development process, test items in unit testing, integration testing, and multiple composite testing/stability testing are extracted manually by a skilled person based on a design specification generated in system design/basic design, functional design, and detailed design. In contrast to this, an extraction method of automatically extracting test items of a testing step from a design specification, which is often written in natural language, has been proposed (see PTL 1).
In this extraction method, training data obtained by tagging important description portions of a design specification written in natural language is prepared, and the trend of the tagged description portions is learned using a machine learning logic (e.g., CRF (Conditional Random Fields)). Then, in this extraction method, based on the learning result, a new design specification is tagged using a machine learning logic, and the test items are extracted in a mechanical manner from the tagged design specification.
In the conventional extraction method, an attempt was made to improve the accuracy of machine learning for extracting test items by preparing as many related natural language documents as possible and increasing the amount of training data. However, training data includes description portions that are unrelated to the tag, in addition to the description portions to be tagged. For this reason, in the conventional extraction method, there have been limitations on the improvement of the accuracy of machine learning since the probability calculation for the description portions that are unrelated to the tag is also reflected during learning of the training data. As a result, in the conventional extraction method, there have been cases in which it is difficult to efficiently extract test items from test data such as a design specification in a software development process.
The present invention was made in view of the foregoing circumstances, and aims to provide a learning apparatus, an extraction apparatus, and a learning method, according to which it is possible to efficiently learn tagged portions in a software development process.
In order to solve the above-described problems and achieve the object, a learning apparatus according to the present invention includes: a pre-processing unit configured to perform, on training data that is data described in natural language and in which a tag has been provided to an important description portion in advance, pre-processing for calculating an information gain that indicates a degree of relevance to the tag for each word and deleting a description portion with low relevance to the tag from the training data based on the information gain of each word; and a learning unit configured to learn the pre-processed training data and generate a list of conditional probabilities relating to the tagged description portion.
According to the present invention, it is possible to efficiently learn tagged portions in a software development process.
Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited by the embodiment. Also, identical portions are denoted by identical reference numerals in the description of the drawings.
Regarding an extraction apparatus according to an embodiment, a schematic configuration of the extraction apparatus, a flow of processing of the extraction apparatus, and a specific example of the processing will be described.
Next, a configuration of the extraction apparatus 10 will be described.
The input unit 11 is an input interface for receiving various operations from an operator of the extraction apparatus 10. For example, the input unit 11 is constituted by an input device such as a touch panel, an audio input device, a keyboard, or a mouse.
The communication unit 12 is a communication interface for transmitting and receiving various types of information to and from another apparatus connected via a network or the like. The communication unit 12 is realized by an NIC (Network Interface Card) or the like, and performs communication between another apparatus and the control unit 14 (described later) via an electrical communication line such as a LAN (Local Area Network) or the Internet. For example, the communication unit 12 inputs training data De, which is data written in a natural language (e.g., a design specification) and in which important description portions have been tagged, to the control unit 14. Also, the communication unit 12 inputs the test data Da from which the test items are to be extracted to the control unit 14.
Note that the tag is, for example, Agent (Target system), Input (input information), Input condition (complementary information), Condition (Condition information of system), Output (output information), Output condition (complementary information), or Check point (check point).
The storage unit 13 is a storage apparatus such as an HDD (Hard Disk Drive), an SSD (Solid State Drive), or an optical disc. Note that the storage unit 13 may also be a data-rewritable semiconductor memory such as a RAM (Random Access Memory), a flash memory, or an NVSRAM (Non Volatile Static Random Access Memory). The storage unit 13 stores an OS (Operating System) and various programs to be executed by the extraction apparatus 10.
Furthermore, the storage unit 13 stores various types of information to be used in the execution of the programs. The storage unit 13 includes a conditional probability list 131 relating to the tagged description portions. The conditional probability list 131 is obtained by associating the type of the assigned tag and the assigned probability with the front-rear relationship and context of each word. The conditional probability list 131 is generated due to the description portions in which tags are present being statistically learned by the learning unit 142 (described later) based on the training data.
The control unit 14 performs overall control of the extraction apparatus 10. The control unit 14 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). Also, the control unit 14 includes an internal memory for storing programs and control data defining various processing procedures, and executes processing using the internal memory. Also, the control unit 14 functions as various processing units due to various programs operating. The control unit 14 includes a pre-processing unit 141, a learning unit 142, a tagging unit 143, and a test item extraction unit 144 (extraction unit).
The pre-processing unit 141 performs pre-processing for deleting description portions with low relevance to the tags from the input training data De. The pre-processing unit 141 deletes the description portions with low relevance to the tags from the training data De based on the information gain of each word in the training data De. The pre-processing unit 141 includes an information gain calculation unit 1411 and a deletion unit 1412.
The information gain calculation unit 1411 calculates, for each word, an information gain indicating the degree of relevance to the tag in the training data De. Based on the information gain of each word calculated by the information gain calculation unit 1411, the deletion unit 1412 obtains the description portions with low relevance to the tags and deletes them from the training data De.
The learning unit 142 learns the pre-processed training data and generates a conditional probability list for the tagged description portions.
The tagging unit 143 tags the description content of the test data based on the conditional probability list 131.
The test item extraction unit 144 mechanically extracts test items from the description content of the tagged test data.
The output unit 15 is realized by, for example, a display apparatus such as a liquid crystal display, a printing apparatus such as a printer, or an information communication apparatus. The output unit 15 outputs the test item data Di indicating the test items extracted by the test item extraction unit 144 from the test data Da to a testing apparatus or the like.
Next, learning processing in the processing performed by the extraction apparatus 10 will be described.
First, as shown in
For this reason, the learning unit 142 performs learning using the training data Dp in which portions that will adversely influence the probability calculation have been excluded, and therefore it is possible to perform probability calculation reflecting only the description portions with high relevance to the tags. As a result, compared to the case of learning the training data De as-is, the extraction apparatus 10 can improve the accuracy of machine learning and can generate a more accurate conditional probability list 131.
Next, testing processing in the processing performed by the extraction apparatus 10 will be described.
As shown in
Next, processing performed by the information gain calculation unit 1411 will be described. The information gain calculation unit 1411 calculates an information gain IG(i) using the following Formula (1).
Formula 1
IG(i)=H(m)−{P(Xi=1)H(m|Xi=1)+P(Xi=0)H(m|Xi=0)} (1)
In Formula (1), messages whose occurrence probabilities are (P1, P2, . . . , Pn) are denoted as (m1, m2, . . . , mn). Xi is a condition, where Xi=0 is within tags and Xi=1 is outside of tags. Also, entropy H(m) is indicated by Formula (2) below.
The first term on the right side of Formula (1) indicates the entropy of the occurrence of any word m in a sentence. P(m) indicates a probability that any word m will occur in a sentence. Also, the second term on the right side of Formula (1) indicates the entropy of co-occurrence of a premise Xi and a word m. P(Xi) indicates the probability of being within or outside of tag, and H(m|Xi) indicates the entropy of the occurrence of any word m within or outside of a tag.
A large information gain can be said to reduce entropy. That is, a word with a large information gain is thought, to have a high degree of relevance to the tag.
Next, an information gain calculation procedure will be described. First, a case will be described in which the information gain calculation unit 1411 calculates the entropy H(m) of a word m.
First, as first processing, the information gain calculation unit 1411 counts the total number X of words in the document. As an example of counting, text A obtained by morphologically analyzing a document is prepared, and the information gain calculation unit 1411 counts a word count G based on the text A.
Next, as second processing, the information gain calculation unit 1411 counts an appearance count Y of a word y in the document. As an example of counting, the appearance count Y in the text A is counted for the word y.
Then, as third processing, the information gain calculation unit 1411 calculates Pi using Formula (3) based on the numbers obtained in the first processing and the second processing.
As fourth processing, the information gain calculation unit 1411 calculates the entropy H(m) based on the result obtained in the third processing and based on Formula (2).
Next, a case will be described in which the information gain calculation unit 1411 calculates the entropy H(m|Xi) of the word m during the condition Xi.
First, as fifth processing, the information gain calculation unit 1411 counts the appearance count Y of the word m within the tags Xi=0. As an example of counting, the text A. and a text b obtained by extracting only tagged rows from text A are prepared, and the information gain calculation unit 1411 counts the word count W of the text B and counts the appearance count Z in the text B for the word m in the text A.
Here, the conditional probability P(m|Xi) is indicated as in Formula (4).
Then, P(Xi=0) in Formula (4) is indicated (5), and P(m∩Xi) is indicated by Formula (6).
Accordingly, Formula (4) is indicated as in Formula (7).
As sixth processing, the information gain calculation unit 1411 calculates the entropy H(m|Xi) based on Formula (2) and P(m|Xi=0) obtained by applying the counted W and Z to Formula (7). Then, the Information gain calculation unit 1411 applies the calculation result of the fourth processing and the calculation result of the sixth processing to Formula (1) to obtain the information gain IG(i).
Next, processing performed by the deletion unit 1412 will be described. Based on the information gain of each word calculated by the information gain calculation unit 1411, the deletion unit 1412 obtains the description portions with low relevance to the tags and deletes them from the training data De.
Specifically, the deletion unit 1412 deletes words for which the information gain calculated by the information gain calculation unit 1411 is lower than a predetermined threshold value from the training data. For example, when the information gain calculation unit 1411 calculates the information gain for each word of the training data De (see (1) in
In the case of the training data De1 shown in
Also, the deletion unit 1412 determines whether or not to perform deletion in units of sentences based on the information gain calculated by the information gain calculation unit 1411 and the information gain of a predetermined part, of speech in the sentence. Specifically, the deletion unit 1412 deletes sentences that do not include nouns for which the information gain calculated by the information gain calculation unit 1411 is higher than a predetermined threshold value from the training data.
Words with high information gains and words with low information gains are both included in the training data De. Also, words that are common among sentences, such as “desu” and “masu”, and technical terms are both included in the training data De in some cases. In view of this, the deletion unit 1412 considers nouns for which the information gain is higher than a predetermined threshold value to be technical terms, determines that sentences that do not include nouns for which the information gain is higher than a predetermined threshold have no relevance to the tag, and deletes those sentences.
For example, in the case of training data De2 shown in
Also, the deletion unit 1412 determines whether or not to perform deletion in units of sentences based on the information gain calculated by the information gain calculation unit 1411 and whether or not there is a verb in the sentence. Specifically, the deletion unit 1412 deletes sentences that include nouns for which the information gain calculated by the information gain calculation unit 1411 is higher than a predetermined threshold value but do not include verbs from the training data.
Words with high information gains and words with, low information gains are both included in the table of contents, titles, and the like in the training data De. It can be said that even if there were words with high information gains in the table of contents titles and initial phrases of sections if there is no verb in the corresponding line, the words do not correspond to test, items. For this reason, the deletion unit 1412 determines that sentences that do not include verbs but include nouns for which the information gains calculated by the information gain calculation unit 1411 are higher than the predetermined threshold value are description portions that are not to be tagged, and deletes those sentences from the training data. The deletion unit 1412 also deletes lines including only words with low information gains. Although there is a high likelihood that words with high relevance to the tags will be present in the table of contents and the like it is thought that those words will influence the CRF probability calculation in the original context and therefore the influence on the accuracy of the machine learning logic such as CRF is removed by deleting such sentences.
In the case of training data De3 in
Next, a processing procedure of learning processing in the processing performed by the extraction apparatus 10 will be described.
As shown in
A processing procedure of pre-processing (step S2) shown in
As shown in
Next, a processing procedure of testing processing in the processing performed by the extraction apparatus 10 will be described.
As shown in
In contrast to this, with the extraction apparatus 10 according to the present embodiment, before learning, pre-processing for deleting the description portions with low relevance to the tags from the training data De is performed on the training data De. Also, the learning unit 142 performs learning using the training data Dp in which portions that will adversely influence the probability calculation have been excluded, and therefore it is possible to perform probability calculation reflecting only the description portions with high relevance to the tags.
Also, with the extraction apparatus 10, as pre-processing, information gain indicating the degree of relevance to the tags is calculated for each word in the training data De, description portions with low relevance to the tags are obtained based on the information gain of each word, and the obtained description portions are deleted from the training data De. In this manner, with the extraction apparatus 10, the degree of relevance between the tags and the words is quantitatively evaluated, and training data in which only the degrees of relevance are left is suitably generated.
By learning the pre-processed training data, the extraction apparatus 10 can improve the accuracy of machine learning and can generate a highly-accurate conditional probability list 131 compared to the case of learning the training data as-is. That is, the extraction apparatus 10 can accurately learn the tagged portions in the software development process, and accompanying this, the test items can be efficiently extracted from the test data such as a design specification.
The constituent elements of the apparatuses shown in the drawings are functionally conceptual, and are not necessarily required to be physically constituted as shown in the drawings. That is, the specific modes of dispersion and integration of the apparatuses are not limited to those shown in the drawings, and all or a portion thereof can be functionally or physically dispersed or integrated in any unit according to various loads, use conditions, and the like. Furthermore, all or any portion of the processing functions performed by the apparatuses can be realized by a CPU and programs analyzed and executed by the CPU, or can be realized as hardware using wired logic.
Also, among the steps of processing described in the present embodiment, all or a portion of the steps of processing described as being executed automatically can also be performed manually, or all or a portion of the steps of processing described as being performed manually can also be performed automatically using a known method. In addition, the processing procedures, control procedures, specific names, various types of data, and information including parameters that were indicated in the above-described document and in the drawings can be changed as appropriate, unless specifically mentioned otherwise.
A memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disc is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.
The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program defining the steps of processing of the extraction apparatus 10 is implemented as the program module 1093 in which code that is executable by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk, drive 1090. For example, the program module 1093 for executing processing similar to that of the functional configuration of the extraction apparatus 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may also be replaced by an SSD.
Also, setting data that is to be used in the processing of the above-described embodiment is stored in, for example, the memory 1010 or the hard disk drive 1090 as the program data 1094. Also, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to RAM 1012 and executes them as needed.
Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, and may also be stored in, for example, a removal storage medium and be read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may also be stored in another computer connected via a network (LAN, WAN, etc.). Also, the program module 1093 and the program data 1094 may also be read from another computer by the CPU 1020 via the network interface 1070.
Although an embodiment in which the invention achieved by the inventor is applied was described above, the present invention is not limited by the descriptions and drawings forming a portion of the disclosure of the present invention according to the present, embodiment. That is, other embodiments, working examples, operation techniques, and the like achieved based on the present embodiment by a person skilled in the art are all included in the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-174531 | Sep 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/034864 | 9/4/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2020/059506 | 3/26/2020 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20100100537 | Druzgalski | Apr 2010 | A1 |
20110113047 | Guardalben | May 2011 | A1 |
20110257961 | Tinkler | Oct 2011 | A1 |
20130149681 | Tinkler | Jun 2013 | A1 |
20140122117 | Masarie, Jr. | May 2014 | A1 |
20170013429 | Marti | Jan 2017 | A1 |
20170083547 | Tonkin | Mar 2017 | A1 |
20170185674 | Tonkin | Jun 2017 | A1 |
Number | Date | Country |
---|---|---|
2018-18373 | Feb 2018 | JP |
Number | Date | Country | |
---|---|---|---|
20210342521 A1 | Nov 2021 | US |