This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2015-014399, filed on Jan. 28, 2015, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a character-data conversion program, or the like.
Structured documents, such as HTML or XML, include tags and document contents (original texts) in a mixed manner, and they are represented by using a text format. As for tags that are used in structured documents, some of the tags include a variable section, such as a reference source, and the others do not include any variable sections. Here, with regard to the tags that have a variable section, there are few tags that are the same in its entirety, and there are limited types of tags that do not have any variable sections. Examples of the tags that do not include any variable sections are <title></title> or <body></body>.
For compression on the above-described documents that include tags and original texts in a mixed manner, the LZ77-based compression of ZIP, or the like, is known for assigning codes by using longest-match string searching.
Furthermore, as another example, there is a known technology for compressing documents that include a tag that does not have any variable sections (for example, see Japanese Laid-open Patent Publication No. 2000-101442). According to the technology, for example, a data compression device identifies a tag in a character string stream, removes it, and outputs it as tag information. Then, the data compression device allocates a tag code in the position of the character string stream, from which the tag is removed, so as to identify it, encodes the character string stream that includes the allocated tag code, and outputs the code stream. Furthermore, the information on the removed tag is used to search the position of the corresponding tag code in the character string stream.
However, conventional technologies have a first problem in that, when compression is conducted on a document that includes a tag and an original text in a mixed manner, the compression rate of the original text is degraded. Furthermore, in a different perspective, there is a second problem in that, when compression is conducted on a document that includes a tag and an original text in a mixed manner, the positional relationship of the tag and the character string is not maintained.
The first problem is explained. For example, in the case of ZIP, the document compression device allocates an original text and a tag in a slide window and then conducts longest-match string searching; therefore, the optimum character string is dropped out of the slide window. Specifically, the size of the slide window is previously set and, if the data to be stored within the slide window exceeds the size of the slide window, the previously stored data in the slide window is dropped out. Therefore, during the LZ77-based compression on documents that include tags and original texts in a mixed manner, the area of the original text for which longest match is made is small. That is, the LZ77-based compression on documents that include tags and original texts in a mixed manner has a program in that the compression rate of original texts is degraded.
Furthermore, the second problem is explained. Conventional data compression devices allocate a tag code, which is obtained by encoding a tag, in the position of a character string stream and then compresses the character string stream that includes the tag code and the original text; therefore, the positional relationship of the tag and the character string is not maintained.
According to an aspect of an embodiment, a non-transitory computer-readable recording medium stores an encoding program for causing a computer to execute a process. The process includes identifying a plurality of tag sections and a plurality of original text sections in input character data that includes a tag having a variable section. The process includes converting each of a plurality of tags included in the plurality of tag sections into a plurality of first-type codes that respectively corresponds to tag contents of the plurality of tags. The process includes converting original texts in the plurality of original text sections into a plurality of second-type codes, each of the plurality of second-type codes being separated at least in boundaries between the plurality of tag sections and the plurality of original text sections in the input character data. The process includes outputting an encoded data including the plurality of first-type codes and the plurality of second-type codes, positional relationships of the plurality of tags and the original texts in the input character data being maintained with corresponding plurality of first-type codes and corresponding plurality of second-type codes in the encoded data.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
Preferred embodiments of the present invention will be explained with reference to accompanying drawings. Here, the present invention is not limited to the embodiment.
First, with reference to
Here, tags refer to character strings that start with a start symbol ‘<’ and that end with an end symbol ‘>’, and they include tags that do not include any variable sections and tags that include a variable section. Examples of the tag that does not include any variable sections include <title></title> and <body></body>. Examples of the tag including a variable section include the ones in a case where an anchor name is different or if the designating order of attributes is different. For example, the file F1 contains the data “<medical_effect type=“ac01”> . . . <side_effect type=“bf03”>fever</side_effect> . . . ”. In this data, <medical_effect type=“ac01”> and <side_effect type=“bf03”> are the tags that include a variable section. In this data, “fever” corresponds to the character string of the original text other than the tag.
The information processing apparatus loads, into a memory area, the character data that is stored in the compression target file F1. The information processing apparatus extracts a character string from the beginning of the memory area and determines whether the extracted character string is a tag. For example, the information processing apparatus determines whether the starting character of the character string is the start symbol ‘<’ of the tag.
If the character string is a tag, the information processing apparatus collectively registers the entire tag in a dynamic tag dictionary T0 and, based on the dynamic tag dictionary T0, compresses it into the compressed code that corresponds to the registered tag. The compressed code, for example, corresponds to the first-type code.
Here, the dynamic tag dictionary T0 is a dictionary that relates a tag to a dynamic code that is dynamically assigned. Specifically, the information processing apparatus registers the character string of the entire tag and assigns, as a compressed code, a dynamic code that is dynamically assigned in the order it is registered. Furthermore, an example of the data structure of the dynamic tag dictionary T0 is described later.
If the character string is not a tag, the information processing apparatus outputs the character string as the original text to a bit filter B0. The information processing apparatus compares the bit filter B0 with the output character string and determines whether the output character string hits the bit filter B0. If the character string hits the bit filter B0, the information processing apparatus converts it into the compressed code that corresponds to the character string of the word based on a static dictionary. The compressed code, for example, corresponds to the second-type code. Furthermore, according to the embodiment, the character string hits the bit filter B0.
Here, the bit filter B0 is a filter that identifies the character string of the word to be compressed by using the static dictionary. The static dictionary refers to a dictionary that relates a word to a compressed code in accordance with the frequency of the word on a document-to-document basis. Examples of the document include the compression target file. The static dictionary has previously registered therein the static code that is the compressed code that corresponds to each word. Furthermore, an example of the data structure of the bit filter B0 is described later.
The information processing apparatus outputs, to a compression file F2, each compressed code based on the dynamic tag dictionary T0 and each compressed code based on the bit filter B0 in a state such that the pre-conversion positional relationship of a tag or an original text in the input character data with regard to each compressed code is maintained.
An explanation is given of an operation in a case where compression is conducted on the character string “<side_effect type=“bf03”> . . . ” in the file F1 that is the target to be compressed by the information processing apparatus.
First, the information processing apparatus determines whether the starting character of the character string is the start symbol ‘<’ of the tag. In the example of
Furthermore, the information processing apparatus assigns, as a compressed code, the dynamic code d1 that is in the dynamic tag dictionary T0 and that is related to the character string of the tag. Here, “F80001h” is assigned as a compressed code of the character string “<side_effect type=“bf03”>” of the tag. Then, the information processing apparatus outputs the compressed code to the compression file F2 in a state such that the pre-conversion positional relationship of the tag in the input character data with regard to the compressed code is maintained.
Next, an explanation is given of an operation in a case where the information processing apparatus compresses the character string “fever” in the compression target file F1.
First, the information processing apparatus determines whether the starting character “f” of the character string is the start symbol ‘<’ of the tag. In the example of
Example of the Dynamic Tag Dictionary
An explanation is given of, for example, a case where a compressed code is assigned to the character string “<side_effect type=“bf03”>” of the tag.
The information processing apparatus collectively stores the character string “<side_effect type=“bf03”>” of the tag in the tag buffer T1. The information processing apparatus registers, in the address table T2, the storage location in which the character string of the tag is stored and the length of the stored data. Here, the information processing apparatus registers “28” as the storage location and “25” as the data length in the address table T2.
The information processing apparatus assigns, as a compressed code, the dynamic code that is in the address table T2 and that is related to the character string of the tag. Here, the information processing apparatus assigns, as a compressed code, the dynamic code “F80001h” that is related to the character string “<side_effect type=“bf03”>” of the tag.
Example of the Bit Filter
The 2-gram is the information that indicates the character code string of 2 characters. The bitmap indicates the bitmap that corresponds to the character code string of the 2-gram. For example, the bitmap that corresponds to “00h00h” is “0_0_0_0_0”. The pointer is a pointer that indicates the location of the word character string that corresponds to a bitmap.
The word character string is a Japanese word that is registered in the static dictionary, and it is represented by using a character code string. Here, a character code string is noted in parentheses. The character-code string length is the length of the character code string that corresponds to a word character string. The static code is a compressed code that is assigned to a word character string.
An explanation is given of, for example, a case where a compressed code is assigned to the word character string “fever”. The information processing apparatus compares the bit filter B0 with the word character string “fever”, the word character string “fever” hits the bit filter B0, and identifies, as a compressed code, the static code “C00010” that is registered in the static dictionary.
Example of the Configuration of the Compression File
Flow of a Decompression Operation
The information processing apparatus reads the compressed data and determines whether the compressed data is a dynamic code. In the example of
Furthermore, the information processing apparatus reads the compressed data and determines whether the compressed data is a dynamic code. In the example of
Furthermore, the information processing apparatus writes the identified decompressed data in the memory area A3. Furthermore, after the entire compressed data in the compression file F2 is decompressed, the information processing apparatus writes the decompressed data, which has been written in the memory area A3, into the decompression file F3.
Configuration of the Information Processing Apparatus
The compression unit 100a is a processing unit that performs the compression operation that is illustrated in
For example, the data structure of a leaf is represented by 61. For example, the leaf stores the leaf identification information, the compressed code length, and the pointer to the word. The leaf identification information is the information for uniquely identifying the leaf. The compressed code length is the information that indicates a valid length among the bit sequence of the compressed data that is compared with each of the branches 60-1 to 60-n. The pointer to a word is the information for uniquely identifying the decompressed data when the compressed code is decompressed, and it corresponds to the pointer to the decompressed data.
Configuration of the Compression Unit
The file read unit 101 reads the character string of the content portion in the file F1. The file read unit 101 outputs the read character string to the tag determining unit 102.
The tag determining unit 102 determines whether the character string is a tag. For example, the tag determining unit 102 determines whether the starting character of the character string is the start symbol ‘<’ of the tag. If the starting character of the character string is the start symbol ‘<’ of the tag, the tag determining unit 102 outputs the tag character string to the tag encode unit 103. The tag character string is the character string that starts with the start symbol ‘<’ and ends with the end symbol ‘>’. Furthermore, if the starting character of the character string is not the start symbol ‘<’ of the tag, the tag determining unit 102 outputs the character string to the text encode unit 104.
The tag encode unit 103 encodes the tag character string. For example, the tag encode unit 103 determines whether the tag character string is already stored in the tag buffer T1. If the tag character string is already stored in the tag buffer T1, the tag encode unit 103 assigns, as compressed data, the dynamic code that is in the address table T2 and that is related to the tag character string. If the tag character string is not stored in the tag buffer T1, the tag encode unit 103 collectively stores the tag character string in the tag buffer T1 and registers, in the address table T2, the storage location in which the tag character string is stored and the length of the stored data. The tag encode unit 103 assigns, as compressed data, the dynamic code that is in the address table T2 and that is related to the tag character string. Then, the tag encode unit 103 outputs the assigned compressed data to the update unit 105.
The text encode unit 104 encodes a character string. For example, the text encode unit 104 outputs a character string as an original text to the bit filter B0. The text encode unit 104 compares the bit filter B0 with the output character string and determines whether the output character string hits the bit filter B0. If the output character string hits the bit filter B0, the text encode unit 104 identifies, as compressed data, the static code that is registered in the static dictionary. Then, the text encode unit 104 outputs the identified compressed data to the update unit 105.
The update unit 105 acquires the compressed data from the tag encode unit 103 and the text encode unit 104 and stores the acquired compressed data in the memory area in the order they are acquired, whereby the memory area is updated.
After the entire character string of the content portion in the file F1 is compressed, the file write unit 106 writes the compressed data, which has been written in the memory area, into the compression file F2.
Configuration of the Decompression Unit
The file read unit 110 reads compressed data from the compression file F2 into the memory area A1. After the decompression operation is completed for the compressed data that is stored in the memory area, the file read unit 110 reads new compressed data from the compression file F2 and stores it in the memory area A1.
The tag-code determining unit 111 determines whether the compressed data is the code of a tag. For example, the tag-code determining unit 111 determines whether the compressed data is a dynamic code. For example, in a case where the dynamic code is a fixed-length 3-byte code that starts with the hex number “F”, the tag-code determining unit 111 determines whether the beginning four bits of the compressed data are “F” and, if they are “F”, it is determined that it is a dynamic code. Specifically, the tag-code determining unit 111 determines that it is the code of a tag and outputs the compressed data to the tag decompression unit 112. If not “F”, the tag-code determining unit 111 determines that it is not a dynamic code. Specifically, the tag-code determining unit 111 determines that it is not the code of a tag and outputs the compressed data to the text decompression unit 113.
The tag decompression unit 112 decompresses the compressed data by using the dynamic tag dictionary T0. For example, the tag decompression unit 112 identifies the dynamic code that matches the compressed data in the address table T2 of the dynamic tag dictionary T0 and acquires the storage location and the data length that correspond to the identified dynamic code. The tag decompression unit 112 identifies, in the tag buffer T1 of the dynamic tag dictionary T0, the decompressed data with the storage location and the data length that are acquired. The tag decompression unit 112 outputs the identified decompressed data to the update unit 114.
The text decompression unit 113 decompresses compressed data by using the decompression nodeless tree. For example, the text decompression unit 113 compares the compressed data with the decompression nodeless tree and identifies the pointer to the decompressed data, indicated by the decompression nodeless tree. The text decompression unit 113 identifies the decompressed data on the basis of the identified pointer to the decompressed data. The text decompression unit 113 outputs the identified decompressed data to the update unit 114.
The update unit 114 acquires the decompressed data from the tag decompression unit 112 and the text decompression unit 113 and stores the acquired decompressed data in the memory area A3 in the order they are acquired, thereby updating the memory area.
After the entire compressed data in the compression file F2 is decompressed, the file write unit 115 writes the decompressed data, which has been written in the memory area, into the decompression file F3.
Steps of an Operation of the Compression Unit
Next, an explanation is given, with reference to
As illustrated in
The compression unit 100a extracts a character string from the beginning of the memory area and determines whether the character string is a tag section (Step S103). For example, the compression unit 100a determines whether the beginning of the character string is the start symbol ‘<’ of the tag character string.
If it is determined that the character string is a tag section (Step S103; Yes), the compression unit 100a determines whether the tag section is stored in the tag buffer T1 (Step S104). If the tag section is stored in the tag buffer T1 (Step S104; Yes), the compression unit 100a proceeds to Step S106 to assign the dynamic code of the tag section.
Conversely, if the tag section is not stored in the tag buffer T1 (Step S104; No), the compression unit 100a stores the tag section in the tag buffer T1 and stores the storage location and the length of the tag section in the address table T2 (Step S105). Then, the compression unit 100a proceeds to Step S106 to assign the dynamic code of the tag section.
At Step S106, the compression unit 100a assigns, as compressed data, the dynamic code that is in the address table T2 and that corresponds to the tag section (Step S106). Specifically, the compression unit 100a extracts, from the address table T2, the dynamic code that is included in the record that stores the storage location and the length of the tag section, and it assigns the extracted dynamic code as compressed data. Then, the compression unit 100a proceeds to Step S108.
If it is determined that the character string is not a tag section (Step S103; No), the compression unit 100a assigns, as compressed data, the static code that is registered in the static dictionary (Step S107). Specifically, the compression unit 100a compares the character string with the bit filter B0, the character string hits the bit filter B0, and identifies, as compressed data, the static code that is registered in the static dictionary. Then, the compression unit 100a proceeds to Step S108.
At Step S108, the compression unit 100a writes the compressed data in a write memory area (Step S108).
The compression unit 100a determines whether there is a character string to be processed in the read memory area (Step S109). If it is determined that there is a character string to be processed in the read memory area (Step S109; Yes), the compression unit 100a proceeds to Step S103 to process the subsequent character string.
Conversely, if it is determined that there is no character string to be processed in the read memory area (Step S109; No), the compression unit 100a terminates the compression operation.
Steps of an Operation of the Decompression Unit
Next, an explanation is given, with reference to
As illustrated in
The decompression unit 100b reads the compressed data from the read memory area into the memory area A1 and determines whether the compressed data is a dynamic code (Step S204). For example, if the dynamic code is a fixed-length 3-byte code that starts with the hex number “F”, the decompression unit 100b determines whether the beginning four bits of the compressed data are “F”.
If it is determined that the compressed data is a dynamic code (Step S204; Yes), the decompression unit 100b identifies the decompressed data on the basis of the dynamic code in the dynamic tag dictionary T0 (Step S205). For example, the decompression unit 100b identifies, in the address table T2 of the dynamic tag dictionary T0, the dynamic code that matches the compressed data and acquires the storage location and the data length that correspond to the identified dynamic code. The tag decompression unit 112 identifies the decompressed data with the acquired data length from the acquired storage location with respect to the tag buffer T1 of the dynamic tag dictionary T0. Then, the decompression unit 100b proceeds to Step S208.
Conversely, if it is determined that the compressed data is not a dynamic code (Step S204; No), the decompression unit 100b compares the decompression nodeless tree with the compressed data and identifies the pointer to the decompressed data (Step S206). The decompression unit 100b identifies the decompressed data on the basis of the pointer to the decompressed data (Step S207). Then, the decompression unit 100b proceeds to Step S208.
At Step S208, the decompression unit 100b writes the decompressed data in the write memory area (Step S208).
The decompression unit 100b determines whether there is compressed data to be processed in the read memory area (Step S209). If it is determined that there is compressed data to be processed in the read memory area (Step S209; Yes), the decompression unit 100b proceeds to Step S204 to process the subsequent compressed data.
Conversely, if it is determined that there is no compressed data to be processed in the read memory area (Step S209; No), the decompression unit 100b terminates the decompression operation and closes the compression file F2 (Step S210).
As described above, the information processing apparatus 100 identifies a plurality of tag sections and a plurality of original text sections in input character data that includes a tag having a variable section. The information processing apparatus 100 converts each of a plurality of tags included in the plurality of tag sections into a plurality of first-type codes that respectively corresponds to tag contents of the plurality of tags. The information processing apparatus 100 converts original texts in the plurality of original text sections into a plurality of second-type codes, each of the plurality of second-type codes being separated at least in boundaries between the plurality of tag sections and the plurality of original text sections in the input character data. The information processing apparatus 100 outputs an encoded data including the plurality of first-type codes and the plurality of second-type codes, positional relationships of the plurality of tags and the original texts in the input character data being maintained with corresponding plurality of first-type codes and corresponding plurality of second-type codes in the encoded data. With this configuration, the information processing apparatus 100 converts a tag section into the first-type code, converts an original text into the second-type code, and outputs them in a state such that the pre-conversion positional relationship of the tag section and the original text is maintained; therefore, even if there is a tag that has a variable section, the compression rate of the input character data may be improved.
Specifically, some of the tags that are used in HTML, or the like, include a variable section, such as a reference source, and the others do not include any variable sections. Here, conventionally, as for character strings in documents, the frequencies of words or characters used in each document are different, and codes are assigned based on the frequencies of words or characters that are used in the document. With regard to original texts that are the character strings other than tags, the frequencies of words that are used in each document are different, and it is preferable to assign codes based on the frequencies of words that are used in the original text. As for tags, with regard to tags that have a variable section, there are few tags that are the same in its entirety, and there are limited types of tags that do not have any variable sections. Conventionally, compression is not conducted depending on the differences in the above-described characteristics with regard to documents that include a tag, especially, documents that include a tag having a variable section; therefore the compression rate is degraded. Conversely, the information processing apparatus 100 according to the first embodiment converts a tag section into the first-type code, converts an original text into the second-type code, and outputs them in a state such that the pre-conversion positional relationship of the tag section and the original text is maintained; therefore, even if there is a tag that has a variable section, the compression rate of the input character data may be improved.
Furthermore, the information processing apparatus 100 according to the first embodiment identifies a tag section and an original text in the input character data that includes a tag having a variable section. The information processing apparatus 100 converts the tag section and the original text into each code of a different type and outputs the converted code in a state such that the pre-conversion positional relationship of the tag section and the original text is maintained. However, this is not a limitation, and the information processing apparatus 100 may further conduct search as to whether a search keyword is present in the original text that is surrounded by a tag of a specific tag type in the compressed state. In the case of documents related to drugs, for example, if the drug (drug efficacy) that is effective for “fever” needs to be searched with regard to the search keyword “fever”, search is conducted as to whether the search keyword “fever” is present in the original text that is surrounded by the “drug efficacy” tag in a state such that the document is compressed.
Therefore, in the second embodiment, an explanation is given of a case where the information processing apparatus 100 conducts search as to whether a search keyword is present in the original text that is surrounded by the tag of a specific tag type in a compressed state.
Flow of a Compression Operation
First, an explanation is given, with reference to
As is the case with
The information processing apparatus loads, into a memory area, the character data that is stored in the compression target file F1. The information processing apparatus extracts a character string from the beginning of the memory area and determines whether the extracted character string is a tag. For example, the information processing apparatus determines whether the starting character of the character string is the start symbol ‘<’ of the tag.
If the character string is a tag, the information processing apparatus determines the type (tag type) of the tag. For example, if the tag is “<medical_effect type=“ac01”>”, the information processing apparatus determines that the tag type is “drug efficacy” on the basis of “medical_effect” that is included in the tag. As another example, if the tag is “<side_effect type=“bf03”>”, the information processing apparatus determines that the tag type is “side effect” on the basis of “side_effect” that is included in the tag.
The information processing apparatus collectively stores the entire tag character string in the dynamic tag dictionary T10 and stores the location (storage location) where it is stored, the length (data length), and the tag type in the dynamic tag dictionary T10. Then, on the basis of the dynamic tag dictionary T10, the information processing apparatus compresses the tag character string into the compressed code that corresponds to the tag character string. Furthermore, an example of the data structure of the dynamic tag dictionary T10 is described later.
If the character string is not a tag, the information processing apparatus outputs the character string as the original text to the bit filter B0 and compresses the output character string into the compressed code (static code) that corresponds to the output character string on the basis of the bit filter B0. Here, the compression operation in a case where the character string is not a tag is the same as that in the first embodiment; therefore, the details are omitted.
The information processing apparatus outputs, to the compression file F2, each compressed code based on the dynamic tag dictionary T10 and each compressed code based on the bit filter B0 in a state such that the pre-conversion positional relationship of the tag or the original text with regard to each compressed code in the input character data is maintained.
Example of the Dynamic Tag Dictionary
Here, an explanation is given of a case where a compressed code is assigned to the character string “<side_effect type=“bf03”>” of the tag.
The information processing apparatus determines that the tag type is “side effect” on the basis of the character string “side_effect type” of the tag and acquires “88” that is related to “side_effect type”. The information processing apparatus collectively stores the character string “<side_effect type=“bf03”>” of the tag in the tag buffer T11. The information processing apparatus registers the storage location in which the character string of the tag is stored, the length of the stored data, and the tag type in the address table T12. Here, the information processing apparatus registers “28” as the storage location, “25” as the data length, and “88” as the tag type in the address table T12.
The information processing apparatus assigns, as a compressed code, the dynamic code that is in the address table T12 and that is related to the character string of the tag. Here, the information processing apparatus assigns, as a compressed code, the dynamic code “F80001h” that is related to the character string “<side_effect type=“bf03”>” of the tag.
Flow of a Search Operation
The information processing apparatus receives the search keyword and the tag type of the search tag. In the example of
On the basis of the tag type in the dynamic tag dictionary T10, the information processing apparatus identifies the dynamic code that corresponds to the tag type of the search tag. For example, the information processing apparatus identifies the hex number “F80001” as the dynamic code that corresponds to the tag type “88” of the search tag in the address table T12 of the dynamic tag dictionary T10.
The information processing apparatus identifies the compressed code (static code) that corresponds to the search keyword on the basis of the bit filter B0 by using the original text as the search keyword. Furthermore, the compression operation of the search keyword is the same as the compression operation in a case where the character string is not a tag, and it is the same as that in the first embodiment; therefore, the details are omitted. Here, the static code of the search keyword “fever” is “A”.
With regard to the compression files F21 and F22, the information processing apparatus searches the appearance position of the dynamic code that corresponds to the tag type of the search tag and the appearance position of the compressed code that corresponds to the search keyword. In the example of
Conversely, in the compression file F22, the dynamic code “F80001” that corresponds to the tag type “88” of the search tag appears after the compressed code “A” that corresponds to the search keyword “fever”. Here, according to the second embodiment, the compressed codes are output to the compression files F21 and F22 in a state such that the pre-conversion positional relationship of the tag or the original text is maintained. Therefore, the information processing apparatus determines that the search keyword “fever” is not present in the original text that is surrounded by the tag “side_effect type=“bf03”” having the tag type “88” of the search tag and the tag “/side_effect type”.
The information processing apparatus outputs a search result. For example, if the search condition is matched, the information processing apparatus outputs “OK” as a search result. In addition, on the basis of the dynamic tag dictionary T10, the information processing apparatus outputs the character string that is obtained by decompressing the compressed portion that matches the search condition. In the example of
Configuration of the Information Processing Apparatus
The compression unit 200a is a processing unit that performs the compression operation that is illustrated in
Configuration of the Search Unit
The search-key receiving unit 201 receives search keys. For example, the search-key receiving unit 201 receives the search keyword and the tag type of the search tag as search keys.
The search-key position search unit 202 searches the position that corresponds to the search key from the compression file F2. For example, the search-key position search unit 202 uses the tag type in the address table T12 included in the dynamic tag dictionary T10 to identify the dynamic code that corresponds to the tag type of the search tag. The search-key position search unit 202 identifies the compressed code (static code) that corresponds to the search keyword on the basis of the bit filter B0 by using the original text as the search keyword. Then, the search-key position search unit 202 searches the appearance position of the dynamic code that corresponds to the tag type of the search tag and the appearance position of the compressed code that corresponds to the search keyword with regard to the compression file F2.
The search-condition matching determining unit 203 determines whether the appearance position matches the search condition. For example, the search condition is such that the appearance position of the dynamic code that corresponds to the tag type of the search tag is immediately before the appearance position of the compressed code (static code) that corresponds to the search keyword. The search-condition matching determining unit 203 determines whether the appearance position of the dynamic code that corresponds to the tag type of the search tag is immediately before the appearance position of the compressed code (static code) that corresponds to the search keyword. The search-condition matching determining unit 203 determines that the search condition is matched if the appearance position of the dynamic code that corresponds to the tag type of the search tag is immediately before the appearance position of the compressed code (static code) that corresponds to the search keyword. The search-condition matching determining unit 203 determines that the search condition is not matched if the appearance position of the dynamic code that corresponds to the tag type of the search tag is not immediately before the appearance position of the compressed code (static code) that corresponds to the search keyword.
The search-result output unit 204 outputs a search result. For example, if it is determined that the appearance position matches the search condition, the search-result output unit 204 outputs, as a search result, “OK” that indicates that the search condition is matched. In addition, the search-result output unit 204 outputs the character string that is obtained by decompressing the compressed portion at the appearance position that matches the search condition on the basis of the dynamic tag dictionary T10. Furthermore, the search-result output unit 204 may output the character string that is obtained by decompressing the beginning portion of the compression file F2 where the appearance position that matches the search condition is present. If it is determined that the search condition is not matched, the search-result output unit 204 outputs, as a search result, “NG” that indicates that the search condition is not matched.
Steps of an Operation of the Search Unit
Next, an explanation is given, with reference to
As illustrated in
The search unit 200b uses the tag type in the dynamic tag dictionary T10 to identify the dynamic code that corresponds to the tag type of the search tag (Step S304). For example, the search unit 200b acquires, from the address table T12 included in the dynamic tag dictionary T10, the record with the same tag type as that of the search tag. The search unit 200b identifies the dynamic code included in the acquired record.
The search unit 200b identifies the compressed code that corresponds to the search keyword from the static dictionary (Step S305). For example, the search unit 200b identifies the compressed code (static code) that corresponds to the search keyword on the basis of the bit filter B0 by using the original text as the search keyword.
Next, the search unit 200b searches the appearance position of the dynamic code and the appearance position of the compressed code with regard to the compression file F2 (Step S306).
Then, the search unit 200b determines whether the appearance position matches the search condition (Step S307). For example, the search condition is such that the appearance position of the dynamic code that corresponds to the tag type of the search tag is immediately before the appearance position of the compressed code (static code) that corresponds to the search keyword. Then, by using the appearance position of the searched dynamic code and the appearance position of the searched compressed code, the search unit 200b determines whether the appearance position of the dynamic code is immediately before the appearance position of the compressed code.
If it is determined that the appearance position matches the search condition (Step S307; Yes), the search unit 200b outputs “OK” as a search result (Step S308). In addition, the search unit 200b outputs the character string that is obtained by decompressing the compressed portion at the appearance position that matches the search condition on the basis of the dynamic tag dictionary T10. Furthermore, the search unit 200b may output the character string that is obtained by decompressing the beginning portion of the compression file F2 where the appearance position that matches the search condition is present. Then, the search unit 200b terminates the search operation.
Conversely, if it is determined that the appearance position does not match the search condition (Step S307; No), the search unit 200b outputs “NG” as a search result (Step S309). Then, the search unit 200b terminates the search operation.
As described above, the information processing apparatus 200 identifies a tag section and an original text in the input character data that includes a tag having a variable section. In addition to the tag content, the information processing apparatus 200 registers, in the dynamic tag dictionary T10, the type attribute information that corresponds to the attribute of the tag in relation to the first-type code and converts each tag included in the tag section into the first-type code that corresponds to the tag content. The information processing apparatus 200 converts an original text into the second-type code at least in units separated by tags in the input character data. The information processing apparatus 200 outputs each first-type code and each second-type code in a state such that the pre-conversion positional relationship of a tag or character string in the input character data with regard to each code is maintained. With this configuration, the information processing apparatus 200 may conduct search in consideration of the tagging state with regard to the original text while compression is applied. Specifically, the information processing apparatus 200 stores, in the dynamic tag dictionary T10, the type attribute information on the tag in relation to the first-type code in addition to the tag content. Therefore, by using the dynamic tag dictionary T10, the information processing apparatus 200 may determine whether the designated search keyword is present in the original text that is surrounded by the tag that has the type attribute information on a specific tag while compression is applied.
Furthermore, the information processing apparatus 200 according to the second embodiment separately encodes a tag section and the word of an original text and outputs them in a state such that the positional relationship in the original file F1 is maintained. As an example of implementation of the above-described encoding output, the information processing apparatus 200 may attach, to the code that is obtained by encoding the word of an original text, the dynamic code that is obtained by converting the tag that is attached to the word that corresponds to the code, and output it.
Therefore, in a third embodiment, an explanation is given of a case where the information processing apparatus 200 attaches, to the code that is obtained by encoding the word of an original text, the dynamic code that is obtained by converting the tag that is attached to the word that corresponds to the code, and outputs it.
Flow of the Compression Operation
First, an explanation is given, with reference to
As illustrated in
In the example of
The information processing apparatus 200 outputs each compressed code, which is obtained by encoding, to a memory area F2′ in a state such that the pre-conversion positional relationship of the tag or the original text with regard to each compressed code in the input character data is maintained.
As illustrated in
The information processing apparatus 200 attaches, to the compressed code that is obtained by encoding the character string of the original text, the dynamic code that is obtained by encoding the tag that is attached to the character string that corresponds to the compressed code and outputs it to the compression file F2. Specifically, during encoding of the original text, in addition to the code (static code) that corresponds to the character string (word) of the original text, the information processing apparatus 200 attaches the code (dynamic code) of the tag in the tagging in accordance with the word. In the example of
Thus, the information processing apparatus 200 attaches, to the compressed code that is obtained by encoding the word of the original text, the dynamic code that is obtained by encoding the tag that is attached to the word that corresponds to the compressed code and outputs it to the compression file F2; thus, search may be conducted in consideration of the tagging state with regard to the original text while compression is applied. Specifically, the information processing apparatus 200 may determine whether the designated search keyword is present in the original text that is surrounded by the tag that has the tag type of the search tag while compression is applied.
Flow of the Search Operation
The information processing apparatus 200 receives the search keyword and the tag type of the search tag. In the example of
The information processing apparatus 200 uses the tag type in the dynamic tag dictionary T10 to identify the dynamic code that corresponds to the tag type of the search tag. For example, the information processing apparatus 200 identifies the hex number “F80001h” as the dynamic code that corresponds to the tag type “88” of the search tag in the address table T12 of the dynamic tag dictionary T10.
The information processing apparatus 200 identifies the compressed code (static code) that corresponds to the search keyword on the basis of the bit filter B0 by using the original text as the search keyword. Furthermore, the compression operation of the search keyword is the same as the compression operation in a case where the character string is not a tag, and it is the same as that in the first embodiment; therefore, the details are omitted. Here, the static code of the search keyword “fever” is “A”.
With regard to the compression files F21 and F22, the information processing apparatus 200 searches the appearance position of the dynamic code that corresponds to the tag type of the search tag and the appearance position of the compressed code that corresponds to the search keyword. In the example of
Under this condition, in the compression file F21, the dynamic code “F80001” that corresponds to the tag type “88” of the search tag is attached to the static code “A” of the search keyword “fever”. Therefore, the information processing apparatus 200 determines that the search keyword “fever” is present in the original text that is surrounded by the tag “side_effect type=“bf03”” having the tag type “88” of the search tag and the tag “/side_effect type” in the compressed state.
Conversely, in the compression file F22, the dynamic code “F80001” that corresponds to the tag type “88” of the search tag is not attached to the static code “A” of the search keyword “fever”. Therefore, the information processing apparatus 200 determines that the search keyword “fever” is not present in the original text that is surrounded by the tag “side_effect type=“bf03”” having the tag type “88” of the search tag and the tag “/side_effect type” in a compressed state.
The information processing apparatus 200 outputs a search result. In the example of
Thus, the information processing apparatus 200 may conduct search in consideration of the tagging state with regard to the original text while compression is applied. Specifically, the information processing apparatus 200 may determine whether the designated search keyword is present in the original text that is surrounded by the tag that has the tag type of the search tag while compression is applied.
Another Aspect Related to the First Embodiment to the Third Embodiment
Hereafter, part of a modified example of the above-described embodiments is explained. In addition to the modified example that is described below, design changes may be made as appropriate without departing from the scope of the present invention.
Furthermore, according to the first embodiment to the third embodiment, the information processing apparatus 100, 200 identifies a tag section and an original text in the input character data that includes a tag having a variable section and converts each of the tag section and the original text into a code of different type. Then, the information processing apparatus 100, 200 outputs the converted code in a state such that the pre-conversion positional relationship of the tag section and the original text is maintained. However, the information processing apparatus 100, 200 may perform the same compression operation on not only a tag that has a variable section but also a file name including a path or a mail address. Specifically, the information processing apparatus 100, 200 identifies a path section and a section other than the path section in the input character data that includes a file name including a path and converts each of the path section and the section other than the path section into a code of a different type. Then, the information processing apparatus 100, 200 may output the converted code in a state such that the pre-conversion positional relationship of the path section and the section other than the path section is maintained. Furthermore, the information processing apparatus 100, 200 identifies a mail address section and a section other than the mail address section in the input character data that includes the mail address and converts each of the mail address section and the section other than the mail address section into a code of a different type. Then, the information processing apparatus 100, 200 may output the converted code in a state such that the pre-conversion positional relationship of the mail address section and the section other than the mail address section is maintained. Thus, the information processing apparatus 100, 200 may improve the compression rate of the input character data even if there is a path section or a mail address section in addition to a tag that has a variable section.
Furthermore, the steps of the operation, control procedures, specific names, and information including various types of data or parameters, which are described in the first embodiment to the third embodiment, may be changed as appropriate if not otherwise specified.
Hardware Configuration of the Information Processing Apparatus
The hard disk device 408 stores a character-data conversion program that has the same functionality as each processing unit, such as the tag determining unit 102, the tag encode unit 103, the text encode unit 104, and the update unit 105, which are illustrated in
The CPU 401 reads each program that is stored in the hard disk device 408, loads it into the RAM 407, and executes it, thereby performing various operations. The programs may cause the computer 400 to serve as, for example, the tag determining unit 102, the tag encode unit 103, the text encode unit 104, and the update unit 105, which are illustrated in
Furthermore, the above-described character-data conversion program does not always need to be stored in the hard disk device 408. For example, the program may be stored in a storage medium readable by the computer 400 and executed by being read by the computer 400. The storage medium readable by the computer 400 is equivalent to, for example, a portable recording medium, such as a CD-ROM, DVD disk, or universal serial bus (USB) memory, a semiconductor memory, such as a flash memory, or a hard disk drive. Furthermore, the program may be stored in a device that is connected to a public network, the Internet, a local area network (LAN), or the like, and executed by being read from the above by the computer 400.
If a compression command is received by the CPU 401, processing is performed based on at least part of the middleware 28 or the application program 29, and the compression function of the compression unit 100a is implemented (the hardware group 26 is controlled based on the OS 27 for the above processing). The compression function may be included in the application program 29, or it may be part of the middleware 28 that is executed when it is invoked in accordance with the application program 29.
According to an aspect, even if compression is conducted on a document that includes a tag and an original text in a mixed manner, the compression rate of the original text may be improved. Furthermore, while compression is applied to a document that includes a tag and an original text in a mixed manner, a character string may be searched in the original text in consideration of the tagging state.
All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2015-014399 | Jan 2015 | JP | national |