This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2022-185041, filed Nov. 18, 2022, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a document retrieving apparatus and a document retrieving method.
A document retrieving apparatus that retrieves, from a document, a sentence or a paragraph that is relevant to a query input by a user is known. In some general documents, semantically relevant information is distributed to a plurality of sentences or paragraphs, or plural pieces of information that are semantically different from each other are included in a single paragraph. Therefore, in some cases, a user needs to check not only a sentence indicated by the document retrieving apparatus but also sentences before and after the sentence, or not only information desired by a user but also information having low relevance to the information is included in a paragraph indicated by the document retrieving apparatus.
It is important that the document retrieving apparatus can easily retrieve desired information.
According to one embodiment, a document retrieving apparatus includes a memory and processing circuitry. The memory is configured to store block information indicating a plurality of blocks and a plurality of reference features that is associated with the blocks, the blocks each being a group of semantically relevant sentences included in a document. The processing circuitry is configured to extract a retrieval feature to be used in retrieval from a query that is input, retrieve a first block that is relevant to the query from the blocks based on matching of the retrieval feature with the reference features, and generate display information for conducting an emphasis display of the first block.
Hereinafter, embodiments will be described with reference to the accompanying drawings.
The document retrieving apparatus 101 may be implemented in a computer such as a server. The document retrieving apparatus 101 holds one or more documents to be published to users. The document retrieving apparatus 101 retrieves a document that is relevant to a query input by a user, and outputs its retrieval result to the user. An output of the retrieval result includes an emphasis display of a portion that is relevant to the query in a document that is hit in the retrieval.
The user terminal 102 may be a computer that is associated with a user, such as a personal computer or a smartphone. The user terminal 102 includes an input device such as a keyboard, and an output device including a display device. The input device is used to input information such as a query. The display device is used to display information, such as a retrieval result that is acquired by the document retrieving apparatus 101.
As an example, the document retrieving apparatus 101 is implemented as a Web application in a server, and the user terminal 102 accesses the document retrieving apparatus 101 by using a Web browser. If the user terminal 102 accesses the document retrieving apparatus 101, a retrieval screen including an input form for inputting a query is displayed on the Web browser. When a user inputs the query to the input form, the document retrieving apparatus 101 receives, from the user terminal 102, the query input by the user, performs document retrieval by using the received query, and acquires a retrieval result. The document retrieving apparatus 101 adds the retrieval result to the retrieval screen in order to present the retrieval result to the user by using the user terminal 102.
The host device 103 can be a computer, such as a personal computer, that is used by an administrator that administrates the document retrieving apparatus 101. For example, the host device 103 transmits, to the document retrieving apparatus 101, a document to be published to the user.
A system configuration illustrated in
The analyzing unit 201 acquires one or more documents, analyzes the documents, and stores the documents in the first storage unit 202 in association with an analysis result. For example, the analyzing unit 201 receives a document from the host device 103 illustrated in
The first storage unit 202 stores one or more documents and the block information for each of the documents. The block information for each of the documents indicates a plurality of blocks included in this document and reference features that is associated with the respective blocks.
The acquisition unit 211 acquires a query that is input by a user, and transmits the query to the retrieving unit 212. For example, the acquisition unit 211 receives the query from the user terminal 102. The query may be a sentence, or may be a keyword.
The retrieving unit 212 receives the query from the acquisition unit 211, and extracts, from the query, a feature to be used in retrieval. The feature extracted from the query is also referred to as a retrieval feature. Moreover, the retrieving unit 212 retrieves a block that is relevant to the query from the first storage unit 202, by using the retrieval feature. Specifically, the retrieving unit 212 retrieves a block that is relevant to the query from the blocks included in the block information based on matching of the retrieval feature with the reference features stored in the first storage unit 202. The retrieving unit 212 determines a block that is associated with the reference feature that matches the retrieval feature as a block that is relevant to the query. The retrieving unit 212 generates a retrieval result indicating a block that is hit in retrieval (the block that is relevant to the query), and transmits the retrieval result to the display information generating unit 213.
The display information generating unit 213 generates display information for conducting an emphasis display of the block that is hit in retrieval, and outputs the display information. The display information generating unit 213 transmits the display information to the user terminal 102. The user terminal 102 receives the display information from the document retrieving apparatus 101, and displays the display information in a display device. In the display device, the document is displayed in a state where the block that is relevant to the query that is input by the user is emphasized.
Next, an operation of the document retrieving apparatus 101 is described.
In step S301, the analyzing unit 201 receives one or more documents that are input by an administrator. For example, the administrator specifies a document to be published to a user in the host device 103, and the host device 103 transmits the document specified by the administrator to the document retrieving apparatus 101. The analyzing unit 201 receives the document from the host device 103. The document may be a document that has a hierarchical structure, such as a chapter or a paragraph, or may be a document that does not have the hierarchical structure, such as text generated by performing speech recognition. The document may be a document file of any format such as HTML format, PDF format, or Word format. The analyzing unit 201 performs a series of processes illustrated in steps S302 to S304 on each of the documents.
In step S302, the analyzing unit 201 divides the received document into block units, and generates a plurality of blocks. In division into block units (block division), a deep learning model may be used. The analyzing unit 201 uses a deep learning model that has been learned in supervised learning in advance to estimate whether a block boundary is present between two sentences that are continuous in the document.
For example, the deep learning model is configured to receive two sentences as an input, and output a score indicating relevance between two sentences. The analyzing unit 201 inputs the two sentences that are continuous in the document to the deep learning model, and obtains a score that is output from the deep learning model. In the present embodiment, the score is defined to have a greater value as the two sentences have higher relevance. In a case where the score exceeds a predetermined threshold, the analyzing unit 201 determines that the block boundary is not present between the two sentences, and stated another way, the analyzing unit 201 determines that the two sentences will be caused to belong to the same block. In a case where the score is less than or equal to the predetermined threshold, the analyzing unit 201 determines that the block boundary is present between the two sentences, and stated another way, the analyzing unit 201 determines that the two sentences will be caused to belong to blocks different from each other. In another embodiment, the score may be defined to have a smaller value as the two sentences have higher relevance.
Alternatively, the deep learning model may be configured to convert a sentence into a vector. For example, the deep learning model is configured to receive one sentence as an input, and output a vector that expresses this sentence. The analyzing unit 201 inputs the sentences included in the document to the deep learning model one by one to acquire vectors that correspond to individual sentences. The analyzing unit 201 performs agglomerative clustering on these vectors to put together sentences having high relevance into a block.
A division result obtained by using any of the methods described above may be corrected according to a predetermined rule using structure information of the document. This correction processing can prevent occurrence of a situation where a large number of small blocks are generated or a small number of large blocks are generated. For example, it is assumed that an itemized form or a table included in a document includes N items or rows. In a case where N/2 or more pieces of blocks are generated as a result of performing block division on the itemized form or the table, the analyzing unit 201 determines one item or row as one block. In contrast, in a case where less than N/2 pieces of blocks are generated, the analyzing unit 201 the entirety of the itemized form or table as one block. In a case where the itemized form includes one item or the table includes one row, the analyzing unit 201 may correct the division result in such a way that the entirety of one item or one row forms one block.
Referring to
In another example, the reference feature may be a vector that expresses a sentence included in each of the blocks. In this case, the analyzing unit 201 vectorizes a sentence by using a model that has been learned in advance. For example, the model is configured to use a sentence as an input, and output a vector that expresses this sentence. As the model, the recurrent neural network (RNN) or the bidirectional encoder representations from transformers (BERT) can be used. The analyzing unit 201 acquires a vector for each sentence included in one block.
Note that in a case where a document has a hierarchical structure, the analyzing unit 201 may extract the reference feature from not only the block but also a hierarchy that is higher than a hierarchy including the block. In this case, the analyzing unit 201 extracts the feature from titles of a chapter and a section to which a block of interest belongs in a method that is similar to a method for extracting the feature from a block. In the example illustrated in
In step S304, the analyzing unit 201 stores, in the first storage unit 202, block information indicating a plurality of blocks acquired by dividing the document, and reference features that are associated with the blocks.
If the document analyzing processing illustrated in
Note that the document analyzing processing illustrated in
In step S501 of
In step S502, the retrieving unit 212 extracts a feature to be used in retrieval as a retrieval feature from the query acquired in step S501. This feature extraction is performed according to a method that is similar to a method in which the analyzing unit 201 extracts a feature from each block included in a document.
In step S503, the retrieving unit 212 refers to the block information stored in the first storage unit 202 by using the retrieval feature, and retrieves a block that is relevant to the query. For example, the retrieving unit 212 determines the block that is relevant to the query by comparing the retrieval feature with the reference features included in the block information. In a case where a keyword is used as the feature, the retrieving unit 212 specifies a block that is associated with a reference feature (a reference keyword) that matches the retrieval feature (a retrieval keyword), as the block that is relevant to the query. A first feature matching a second feature includes that a first keyword serving as the first feature matches a second keyword serving as the second feature, and that the first keyword is a quasi-synonym, a synonym, or a homonym of the second keyword. In a case where a vector is used as the feature, a first feature matching a second feature includes, for example, that a degree of similarity (for example, cosine similarity) between a first vector serving as the first feature and a second vector serving as the second feature exceeds a predetermined threshold.
In a case where a keyword is used as the feature, the retrieving unit 212 may specify a block that includes all of the keywords that match a plurality of retrieval keywords, as the block that is relevant to the query. Alternatively, the retrieving unit 212 may specify a block that includes one or more keywords that match the retrieval keyword, as the block that is relevant to the query.
In a case where a plurality of blocks is extracted, these blocks may be ranked according to the order of appearance in the document. Alternatively, the blocks may be ranked according to the number or appearance frequency of keywords that match the retrieval keyword. In this case, the analyzing unit 201 counts the appearance frequency of each keyword in the block after extracting the keywords. The block information further includes the appearance frequency of each of the keywords.
In step S504, the display information generating unit 213 generates display information based on a retrieval result obtained in step S503. The display information generating unit 213 generates display information for conducting an emphasis display of a block that is relevant to a query from a user. Specifically, the display information generating unit 213 adds an instruction of emphasis display based on the retrieval result to the document in order to clearly indicate the block that is relevant to the query from the user. For example, in a case where the document is an HTML file, the display information generating unit 213 inserts an HTML tag into the block that is relevant to the query. As an example, the display information generating unit 213 indicates, in bold, the entirety of the block that is relevant to the query, and changes a font color and a background color of a keyword that matches a retrieval keyword extracted from the query. A method for emphasizing a block and a keyword may be a change in a font or a font size, or a character color or a background color, a change to a bold font, a change to italics, or a combination of two or more of them. Furthermore, instead of changing a font or a background color, the block may be emphasized using a format such as surrounding the entirety of the block with a rectangle. In a case where a plurality of blocks is hit in retrieval, the display information generating unit 213 emphasizes all of these blocks.
In step S505, the display information generating unit 213 outputs the generated display information. For example, the display information generating unit 213 transmits the display information to the user terminal 102. In an example where the document retrieving apparatus 101 is implemented in the local computer, the display information generating unit 213 displays the display information in a display device that is included in the computer or is connected to the computer.
By conducting such a display, a display is conducted in such a way that a portion that is hit in retrieval is easy to view, and blocks before and after the portion are easy to check.
Referring to
As described above, in the first embodiment, the document retrieving apparatus 101 retrieves a block that is relevant to a query from a user in a document, and conducts an emphasis display of a block that is hit in retrieval. This enables a group of semantically relevant pieces of information to be displayed in a form that is easy for the user to view. For example, in a document in which one item (for example, a paragraph) includes two pieces of information that are semantically different from each other, or semantically relevant pieces of information are distributed in a plurality of items (for example, sentences or paragraphs), similarly, information can be displayed to be easily viewed. As a result, the user can efficiently discover desired information.
As illustrated in
The relationship extracting unit 701 receives block information from the analyzing unit 201, estimates relevance between blocks, and stores, in the second storage unit 702, relevance information indicating semantic relevance between the blocks.
The display information generating unit 213 generates display information based on a retrieval result received from the retrieving unit 212 and the relevance information stored in the second storage unit 702.
Next, an operation of the document retrieving apparatus 700 is described.
In step S801 of
The degree of similarity between blocks may be estimated by using a deep learning model that has been learned in advance. The relationship extracting unit 701 converts each of the blocks into one feature vector by using the model. The relationship extracting unit 701 calculates cosine similarity between two feature vectors that correspond to two blocks. The relationship extracting unit 701 determines that the two blocks are relevant to each other (the two blocks have high relevance) in a case where the calculated cosine similarity exceeds a predetermined threshold, and integrates these blocks. The relationship extracting unit 701 determines that the two blocks are not relevant to each other (the two blocks have low relevance) in a case where the calculated cosine similarity is equal to or less than the predetermined threshold, and does not integrate these blocks.
Alternatively, the relationship extracting unit 701 may obtain feature vectors from individual blocks and may perform clustering on the feature vectors to generate relevance information indicating semantic relevance between the blocks.
As illustrated in
Block integration may be performed in consideration of the number of characters included in a block. For example, blocks having a large number of characters are integrated with each other only in a case where a degree of similarity is sufficiently high, and blocks having a small number of characters are integrated with each other even in a case where a degree of similarity is not so high. By doing this, the number of characters included in a block is indirectly controlled, and generation of an excessively long block or generation of a large number of short blocks can be prevented.
In a case where there is a plurality of documents to be retrieved, targets to be integrated may be limited to targets in an identical document. In general, in many cases, different documents have different contents, and therefore there is a low probability that a block of a certain document and a block of another document have high relevance. By limiting targets to be integrated to targets in an identical document, calculation resources, such as the time spent for calculating a degree of similarity, can be reduced.
Referring to
Note that the document analyzing processing illustrated in
In step S1001, the display information generating unit 213 generates display information based on a retrieval result received from the retrieving unit 212 and the relevance information stored in the second storage unit 702. The display information generating unit 213 refers to the relevance information, specifies a block having high relevance to a block indicated by the retrieval result (that is, a block that is relevant to a query that is input by a user), and generates display information for conducting an emphasis display of the block indicated by the retrieval result and the specified block.
In the example illustrated in
Alternatively, a display may be conducted in the form of omitting interposed blocks. By doing this, a user can view collected pieces of information in a more concentrated manner. In the case of omission, a button for switching the display of the entire sentence and the display of only a corresponding block may be displayed in order to view relevant information.
In the first embodiment, a plurality of blocks may be hit in retrieval. In a case where a plurality of blocks is hit in retrieval, it is effective to narrow a document retrieval result in order to acquire desired information. In a second embodiment, a user narrows a document retrieval result by using interaction with a chatbot.
As illustrated in
The selecting unit 1201 selects a block to be used to generate display information and generate a response based on a retrieval result received from the retrieving unit 212 and relevance information stored in the second storage unit 702. In a case where a plurality of blocks is hit in retrieval, the selecting unit 1201 selects one block from these blocks, and transmits selection information indicating the selected block to the display information generating unit 213. For example, the retrieving unit 212 ranks the blocks that are hit in retrieval, and the selecting unit 1201 selects a block that ranks first. In a case where a plurality of blocks is hit in retrieval, the selecting unit 1201 determines a candidate for an additional query for narrowing a document retrieval result based on reference features associated with these blocks. For example, the selecting unit 1201 selects one or more reference features as an additional query candidate from the reference features associated with the blocks.
The display information generating unit 213 generates display information for displaying a list of the blocks that are hit in retrieval and displaying a document that includes the block selected by the selecting unit 1201.
The response generating unit 1202 generates and outputs a response that proposes the additional query candidate determined by the selecting unit 1201 and prompts a user to input an additional query.
Next, an operation of the document retrieving apparatus 1200 is described.
Document analyzing processing according to the second embodiment is the same as the document analyzing processing according to the variation of the first embodiment, and therefore the description of the document analyzing processing is omitted.
As illustrated in
In step S1001, the display information generating unit 213 generates display information for displaying a list of a plurality of blocks that is hit in retrieval and displaying a document that includes the block selected by the selecting unit 1201.
In step S1302, the response generating unit 1202 generates a response based on the block selected by the selecting unit 1201 and the retrieval result acquired by the retrieving unit 212. For example, in a case where a plurality of blocks is hit in retrieval, the response generating unit 1202 generates a response based on the number of blocks that are hit in retrieval and a keyword that is different from a query that is input by a user. The keyword that is different from the query that is input by the user may be selected from reference features (keywords) that are associated with the blocks that are hit in retrieval.
In step S505, the display information generating unit 213 outputs display information, and in step S1303, the response generating unit 1202 outputs a response.
When the user inputs an additional query, the retrieving unit 212 performs retrieval by using the query that is input first and the additional query.
In a case where a retrieval result indicates that a plurality of blocks is hit in retrieval, the response generating unit 1202 generates a response that includes a keyword that is included in any of the blocks that are hit in retrieval, and also includes a keyword that is not mentioned by the user.
As illustrated in
At a point in time when the interaction of the first turn terminated, 20 blocks are hit in retrieval. A block that ranks first and relates to domestic business trip is selected, and a document including the selected block is displayed in the region 1603. As illustrated in
In a case where many pieces of information fail to be displayed in a display device like a smartphone, a document may be displayed in balloons indicating interaction between the user U and the chatbot S, as illustrated in
In a case where a plurality of blocks is hit in retrieval, if all of the blocks are displayed in the form of a balloon, too many balloons are displayed, and it is difficult to view. By only displaying a block selected by the selecting unit 1201 in the form of a balloon, a display that is easy to view can be conducted.
As described above, in the second embodiment, in a case where a plurality of blocks is hit in retrieval, the document retrieving apparatus 1200 sets candidates for an additional query for narrowing a document retrieval result based on a reference feature that is associated with these blocks, and outputs a response that prompts the input of the additional query while proposing the candidates for the additional query. By doing this, a user can also retrieve information that the user does not know well under the leadership of the apparatus.
Each of the document retrieving apparatuses 101, 700, and 1200 can be implemented in a computer. A hardware configuration of a computer that can implement a document retrieving apparatus according to an embodiment is described.
The processor 1901 includes a general-purpose processor such as a central processing unit (CPU). The RAM 1902 includes a volatile memory such as a synchronous dynamic random access memory (SDRAM), and is used as a working area of the processor 1901. The auxiliary storage device 1903 includes a non-volatile memory such as a hard disk drive (HDD) or a solid state drive (SSD), and stores programs including a document retrieval program, data, and the like.
The processor 1901 operates according to a program stored in the auxiliary storage device 1903. When the document retrieval program is executed by the processor 1901, the document retrieval program causes the processor 1901 to perform processing described with respect to the document retrieving apparatuses 101, 700, and 1200. For example, the processor 1901 functions as the analyzing unit 201, the acquisition unit 211, the retrieving unit 212, the display information generating unit 213, the selecting unit 1201, and the response generating unit 1202 that are included in the document retrieving apparatus 1200 in accordance with the document retrieval program. The auxiliary storage device 1903 functions as the first storage unit 202 and the second storage unit 702 that are included in the document retrieving apparatus 1200.
The communication interface 1904 is an interface for performing communication with an external apparatus. The processor 1901 performs communication with the user terminal 102 and the host device 103 via the communication interface 1904.
Note that the processor 1901 may include a dedicated processor such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC) instead of or in addition to the general-purpose processor. The processor 1901 refers to the general-purpose processor, the dedicated processor, or a combination of the general-purpose processor and the dedicated processor. The processor 1901 is also referred to as processing circuitry.
A program such as the document retrieval program may be provided to the computer 1900 in a state stored in a computer-readable recording medium. In this case, the computer 1900 includes a drive that reads data from the recording medium, and acquires the program from the recording medium. Examples of the recording medium include a magnetic disk, an optical disk (a CD-ROM, a CD-R, a DVD-ROM, a DVD-R, or the like), a magneto-optical disk (an MO or the like), and a semiconductor memory. Furthermore, the program may be distributed through a communication network. Specifically, the program may be stored in a server on the communication network, and the computer 1900 may download the program from the server.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2022-185041 | Nov 2022 | JP | national |