This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2017-119271, filed on Jun. 19, 2017, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a non-transitory computer-readable storage medium, an extraction method, and an extraction device.
Although various types of content are made publicly available on web sites, and for example, those pieces of content include content, such as information regarding obsolete technologies, that is not viewed by users. It is desired that such content that is not viewed be deleted during maintenance of the web sites. For example, an example in which moving average values of the numbers of accesses are calculated based on an access log for the content and whether or not usefulness of the content is continuing is determined based on transition of the moving average values has been proposed as a content evaluation method. Also, there has been proposed a technology for extracting main content from web documents and extracting well-known or popular keywords from the extracted main content.
Related technologies are disclosed in Japanese Laid-open Patent Publication No. 2011-154487 and Japanese Laid-open Patent Publication No. 2010-204866.
According to an aspect of the invention, a non-transitory computer-readable storage medium storing a program that causes a computer to execute a process, the process including obtaining reference counts that are numbers of times respective pieces of content were referred to, classifying the pieces of content into a plurality of groups based on the reference counts, selecting one or more feature phrases from each of the pieces of content based on appearance frequencies of words included in each of the pieces of content, and extracting first content that includes a feature phrase which is included in all of the plurality of groups, wherein the feature phrase is any one of the one or more features selected by the selecting.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
There are cases in which, during deletion of content that is not viewed, for example, when content to which the number of accesses is small is simply selected as content to be deleted, content that is likely to be referred to in the future is deleted although the number of accesses thereto is small. Thus, it is desired that content that is likely to be referred to in the future be extracted in advance so that the content is not to be deleted. However, it takes large amounts of time and effort for an administrator of a web site to extract each piece of content while checking it, which is difficult.
An object of one aspect is to provide an extraction program, an extraction method, and an extraction device that make it possible to extract content that is likely to be referred to in the future, even if the number of references (which may be referred to hereinafter as a “reference count”) to the content is small.
An extraction program, an extraction method, and an extraction device according to an embodiment disclosed herein will be described below in detail with reference to the accompanying drawings. The present embodiment is not intended to limit the disclosed technology. What is disclosed in the embodiment described below may appropriately be combined as long as such a combination does not cause contradiction.
Each web server 10 is, for example, an information processing apparatus for operating a web site (also referred to hereinafter as a “site”) for providing information about a group of products to customers, service personnel, and so on. Each web server 10 has pieces of content in the site. Examples of the pieces of content include web pages written in the HyperText Markup Language (HTML). Also, an access log including the numbers of accesses (which are also referred to hereinafter as “reference counts”), access dates and times, and so on for the respective pieces of content are recorded in each web server 10. Based on deletion information received from the extraction device 100, each web server 10 also deletes the content corresponding to the deletion information. Although an example in which one web server 10 provides one site will be described in the present embodiment, the present disclosure is not limited thereto, and one web server 10 may provide a plurality of sites.
The extraction device 100 obtains the reference counts for the respective pieces of content from each web server 10 through the network N, each reference count being the number of times each piece of content was referred to. Based on the reference counts, the extraction device 100 classifies the pieces of content into a plurality of groups. The extraction device 100 extracts main phrases in the content from each of the groups, the main phrases being based on appearance frequencies of words included in the content. The extraction device 100 extracts the content including a main phrase that appears in all of the groups. Thus, the extraction device 100 can extract content that is likely to be referred to in the future, even if the reference count of the content is small.
The configuration of the extraction device 100 will be described next. As illustrated in
The communication unit 110 is implemented by, for example, a network interface card (NIC) or the like. The communication unit 110 serves as a communication interface that is connected to the web servers 10 through the network N in a wired or wireless manner and is responsible for communicating information with the web servers 10. The communication unit 110 outputs the access log, received from each web server 10, to the control unit 130. The communication unit 110 also transmits deletion information, input from the control unit 130, to the corresponding web server 10.
The storage unit 120 is implemented by, for example, a semiconductor memory device, such as a random-access memory (RAM) or a flash memory, or a storage device for a hard disk, an optical disk, or the like. The storage unit 120 includes a keyphrase storage section 121, an undefined-keyphrase storage section 122, a user-dictionary storage section 123, a deletion-candidate storage section 124, and a condition storage section 125. Information used for processing in the control unit 130 is stored in the storage unit 120.
Keyphrases extracted from keyphrase extraction source content are classified according to appearance frequencies of the keyphrases in the content and are stored in the keyphrase storage section 121. Each keyphrase is a main phrase in the content and includes a keyword. Each keyphrase is made of, for example, words comprising only nouns, a phrase including a plurality of nouns, or a phrase comprising a combination of an adjective and a noun.
The “obsolete” is information indicating, of the keyphrases extracted from each of the pieces of content classified into the two groups according to the numbers of accesses, a keyphrase that appears in the group in which the number of accesses is small. The “universal” is information indicating, of the keyphrases extracted from each of the pieces of content classified into the two groups according to the numbers of accesses, a keyphrase that appears in both of the groups. The “trend” is information indicating, of the keyphrases extracted from each of the pieces of content classified into the two groups according to the numbers of accesses, a keyphrase that appears in the group in which the number of accesses is large.
In the example of content A-1 in
Referring to
The “No.” is an identifier for identifying an undefined keyphrase. The “detection date” is information indicating a date when the undefined keyphrase is detected for the first time during evaluation of to-be-evaluated content. The “detection content” is information indicating content from which the undefined keyphrase was detected. The “undefined keyphrase” is information indicating a keyphrase that is included in keyphrases extracted from to-be-evaluated content, that is not classified into any of the “obsolete”, “universal, and “trend”, and that does not exist in a user dictionary. The “status” is information indicating a status of the undefined keyphrase. In the “status”, for example, “WAIT” indicates an on-hold state, and “DEL” indicates a state in which the content including the corresponding undefined keyphrase was deleted. The example in the first row illustrated in
Referring back to
Referring back to
Referring back to
The “user dictionary” is information indicating a threshold for an appearance rate of keyphrases registered in the user dictionary relative to all keyphrases in the to-be-evaluated content. The appearance rate of keyphrases is a keyphrase appearance frequency expressed in percentage. The “obsolete keyphrase” is information indicating a threshold for the appearance rate of obsolete keyphrases relative to all keyphrases in the to-be-evaluated content. The “universal keyphrase” is information indicating a threshold for the appearance rate of universal keyphrases relative to all keyphrases in the to-be-evaluated content. The “trend keyphrase” is information indicating a threshold for the appearance rate of trend keyphrases relative to all keyphrases in the to-be-evaluated content. The “number of days elapsed from last update date” is information indicating a threshold for the number of days elapsed from the last update date of the to-be-evaluated content. The “number of days elapsed from last update date” may be, for example, 30 days.
For example, a central processing unit (CPU) or a micro processing unit (MPU) executes a program stored in an internal storage device by using a random-access memory (RAM) as a work area, to thereby realize the control unit 130. The control unit 130 may also be realized by, for example, an integrated circuit, such as an application-specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
The control unit 130 includes an obtainment unit 131, a first classifier 132, a first extractor 133, a second classifier 134, a second extractor 135, and an updater 136 and realizes or executes functions and effects of information processing described below. That is, the processing units in the control unit 130 execute extraction processing. The extraction processing is executed, for example, at predetermined intervals, such as every month, every three months, every half a year, or every year. The internal configuration of the control unit 130 is not limited to the configuration illustrated in
For example, when an administrator of the web server 10 gives an instruction for evaluating pieces of content in a site by using a terminal apparatus (not illustrated), the obtainment unit 131 sets to-be-evaluated content and keyphrase extraction source content (which may be referred to hereinafter as “extraction source content”). The obtainment unit 131 obtains the to-be-evaluated content and the extraction source content from the corresponding web server 10 via the communication unit 110 and the network N. Also, the obtainment unit 131 obtains an access log of the set extraction source content from the corresponding web server 10 via the communication unit 110 and the network N. That is, the obtainment unit 131 obtains reference counts, which are the numbers of times the respective pieces of content were referred to. The obtainment unit 131 outputs the obtained extraction source content and the obtained access log of the extraction source content to the first classifier 132. The obtainment unit 131 also outputs the obtained to-be-evaluated content to the first extractor 133.
Now, a relationship between to-be-evaluated content and keyphrase extraction sources will be described with reference to
In the example illustrated in
Also, by not designating a site including to-be-evaluated content as a keyphrase extraction source, the obtainment unit 131 can perform more objective evaluation. That is, since it is thought that similar keyphrases are highly likely to scatter in the content in the site, extracting keyphrases from content in other sites and evaluating the extracted keyphrases makes it possible to perform more objective evaluation.
Referring back to
That is, based on the reference counts, the first classifier 132 classifies pieces of content into a plurality of groups. The first classifier 132 also classifies pieces of content (extraction source content) different from the to-be-evaluated content into a plurality of groups.
Upon input of the extraction source content and the classification information from the first classifier 132, the first extractor 133 extracts keyphrases for each of the classified group. That is, the first extractor 133 extracts keyphrases for each of the first and second groups. That is, the first extractor 133 extracts main phrases (keyphrases) in the content from each of the groups, the keyphrases being based on the appearance frequencies of words included in the content. The first extractor 133 outputs the extracted keyphrases for each of the groups to the second classifier 134.
Upon input of the to-be-evaluated content from the obtainment unit 131, the first extractor 133 extracts keyphrases from the to-be-evaluated content. The first extractor 133 outputs the extracted keyphrases of the to-be-evaluated content to the second classifier 134. The first extractor 133 also refers to a timestamp of the to-be-evaluated content to obtain the last update date and time of the to-be-evaluated content. The first extractor 133 outputs the obtained last update date and time to the second extractor 135. The timestamp of the to-be-evaluated content is, for example, information indicating creation date and time, last update date and time, or the like held in a file system of an operating system (OS).
Now, extraction of keyphrases will be described with reference to
Referring back to
That is, the second classifier 134 classifies the main phrases extracted from each of the groups into the first main phrases that appear in only the first group, the second main phrases that appear in both of the first and second groups, and the third main phrases that appear in only the second group.
Upon input of keyphrases of the to-be-evaluated content from the first extractor 133, the second classifier 134 refers to the keyphrase storage section 121 and the user-dictionary storage section 123 to classify the input keyphrases. That is, the second classifier 134 classifies the keyphrases extracted from the to-be-evaluated content into the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases.
The second classifier 134 classifies a keyphrase included in the keyphrases extracted from the to-be-evaluated content and registered in the user dictionary into the user dictionary keyphrases. The second classifier 134 classifies a keyphrase that is included in the keyphrases extracted from the to-be-evaluated content and that matches a keyphrase in the “obsolete” field in the keyphrase storage section 121 into the obsolete keyphrases. The second classifier 134 classifies a keyphrase that is included in the keyphrases extracted from the to-be-evaluated content and that matches a keyphrase in the “universal” field in the keyphrase storage section 121 into the universal keyphrases. The second classifier 134 classifies a keyphrase that is included in the keyphrases extracted from the to-be-evaluated content and that matches a keyphrase in the “trend” field in the keyphrase storage section 121 into the trend keyphrases. The second classifier 134 classifies a keyphrase that is included in the keyphrases extracted from the to-be-evaluated content and that has not been classified into any of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, and the trend keyphrases into the undefined keyphrases.
That is, since undefined keyphrases are keyphrases that do not exist in the keyphrase extraction source content and the user dictionary, it is difficult to directly use the undefined keyphrases for evaluation. Accordingly, based on whether or not a classified undefined keyphrase was also classified into the undefined keyphrases during past evaluation of to-be-evaluated content, the second classifier 134 executes undefined keyphrase processing for determining whether the classified undefined keyphrase is an obsolete keyphrase or a trend keyphrase.
The second classifier 134 determines whether or not an undefined keyphrase exists in the classified keyphrases. When an undefined keyphrase does not exist, the second classifier 134 ends the undefined keyphrase processing. When an undefined keyphrase exists, the second classifier 134 refers to the undefined-keyphrase storage section 122 to check whether or not each undefined keyphrase has appeared in the past.
When the checked undefined keyphrase has appeared in the past, the second classifier 134 classifies the undefined keyphrase into the obsolete keyphrases. When the checked undefined keyphrase has not appeared in the past, the second classifier 134 stores the undefined keyphrase in the undefined-keyphrase storage section 122. When the processing based on whether or not there is an occurrence in the past is completed for all of the undefined keyphrases, the second classifier 134 ends the undefined keyphrase processing. Upon completing the undefined keyphrase processing, the second classifier 134 outputs the classified keyphrases of the to-be-evaluated content to the second extractor 135.
That is, when the checked undefined keyphrase has appeared in the past, the undefined keyphrase is a keyphrase that has not been used in other sites (sites B and C), which are evaluation references, from a past evaluation time to the present time, and thus the second classifier 134 classifies the undefined keyphrase into the obsolete keyphrases. That is, the undefined keyphrase is a keyphrase that was not used (added) in the keyphrase extraction source content. In contrast, when the undefined keyphrase is a keyphrase classified into the trend keyphrases, the undefined keyphrase is highly likely to be added in other sites, and in this case, the unknown keyphrase is classified into the trend keyphrases. That is, when the checked undefined keyphrase has not appeared in the past, the undefined keyphrase is a keyphrase that does not exist in the other sites (sites B and C), which are evaluation references, and thus, the undefined keyphrase is thought to be considerably obsolete or trendy. Thus, the second classifier 134 puts the undefined keyphrase on hold until next evaluation in order to check a future trend and stores the undefined keyphrase in the undefined-keyphrase storage section 122.
In other words, the second classifier 134 stores, in the undefined-keyphrase storage section 122, a fourth main phrase (an undefined keyphrase) that is included in main phrases extracted from to-be-evaluated content and that is a main phrase not corresponding to any of the first main phrases, the second main phrases, and the third main phrases. During next content extraction, when a fourth main phrase extracted from the to-be-evaluated content matches any of the fourth main phrases stored in the undefined-keyphrase storage section 122, the second classifier 134 classifies the extracted fourth main phrase into the first main phrases (obsolete keyphrases).
Now, a description will be given of transition from when what is stored in the undefined-keyphrase storage section 122 changes from an empty state to the state of the undefined-keyphrase storage section 122 illustrated in
Subsequently, when other content including the undefined keyphrase “FM-8” does not appear during next scanning, the undefined keyphrase “FM-8” is classified into the obsolete keyphrases. However, for example, when a user dictionary keyphrase is included in the content “/manual/computer/fm-8/fm-8.html”, and the content does not become content to be deleted, the contents of the undefined-keyphrase storage section 122 do not change. On the other hand, when the content becomes content to be deleted, the status for the undefined keyphrase “FM-8” in the undefined-keyphrase storage section 122 is updated from “WAIT” to “DEL” (that is, the state in the first row in
When other content including the undefined keyphrase “Windows 2016” appears during next scanning, the undefined keyphrase is stored in the keyphrase storage section 121 as a trend keyphrase. In this case, although the record of the undefined keyphrase “Windows 2016” in the undefined-keyphrase storage section 122 is deleted, the record may be kept unchanged and then be deleted during maintenance. The maintenance is performed, for example, when the number of records becomes enormous, and a record in which the status is “DEL” and a record in which the status is “WAIT” and a predetermined number of days (for example, 365 days) has passed may be deleted.
When each classified keyphrase of the to-be-evaluated content is input from the second classifier 134, the second extractor 135 evaluates the to-be-evaluated content. That is, based on the classified keyphrases of the to-be-evaluated content, the second extractor 135 calculates appearance frequencies, that is, appearance rates, of the keyphrases in the to-be-evaluated content for respective classifications of the keyphrases. Specifically, the second extractor 135 calculates an appearance rate of user dictionary keyphrases, an appearance rate of obsolete keyphrases, an appearance rate of universal keyphrases, and an appearance rate of trend keyphrases of all keyphrases extracted from the to-be-evaluated content.
By referring to the condition storage section 125, the second extractor 135 determines whether or not the to-be-evaluated content satisfies the deletion conditions, based on the calculated appearance rates of the classified keyphrases. When the to-be-evaluated content does not satisfy the deletion conditions, the second extractor 135 extracts the to-be-evaluated content as content to be maintained. The second extractor 135 generates update information including trend keyphrases of the to-be-evaluated content and outputs the update information to the updater 136.
When the to-be-evaluated content satisfies the deletion conditions, the second extractor 135 sets the to-be-evaluated content as deletion candidate content and stores the identifier of the set set deletion candidate content in the deletion-candidate storage section 124.
Subsequently, the second extractor 135 executes deletion processing. The second extractor 135 determines whether or not a predetermined number of days has elapsed from the last update date, based on the last update date and time of the to-be-evaluated content input from the first extractor 133 and a condition regarding update of the condition storage section 125. That is, the predetermined number of days is the number of days in the “number of days elapsed from last update date” field in the condition storage section 125.
Upon determining that the predetermined number of days has elapsed from the last update date, the second extractor 135 refers to the deletion-candidate storage section 124 to generate deletion information based on the identifier of the deletion candidate content. The second extractor 135 transmits the generated deletion information to the corresponding web server 10 via the communication unit 110 and the network N. Upon transmitting the deletion information, the second extractor 135 generates update information for updating the deletion conditions and the user dictionary and outputs the update information to the updater 136. The update information includes, for example, the calculated appearance rates of the respective classified keyphrases and obsolete keyphrases included in the content for which the deletion information was transmitted.
Upon determining that the predetermined number of days has not passed from the last update date, the second extractor 135 deletes, from the deletion-candidate storage section 124, the identifier of the deletion candidate content to be determined and releases the setting of the deletion candidate content. The second extractor 135 outputs, to the updater 136, the update information including the trend keyphrases of the to-be-evaluated content for which the setting for the deletion candidate content was released.
The second extractor 135 determines whether or not un-evaluated content exists in a site to which the to-be-evaluated content belongs. Upon determining that un-evaluated content exists in the site to which the to-be-evaluated content belongs, the second extractor 135 designates next to-be-evaluated content and outputs, to the obtainment unit 131, an instruction for obtaining the designated to-be-evaluated content from the corresponding web server 10.
Upon determining that un-evaluated content does not exist in the site to which the to-be-evaluated content belongs, the second extractor 135 outputs, to the updater 136, an update instruction for executing processing for updating the deletion conditions and the user dictionary.
In other words, the second extractor 135 extracts content including a main phrase that appears in all of the groups. Also, when the to-be-evaluated content includes a main phrase that appears in all of the groups, the second extractor 135 extracts the to-be-evaluated content. Also, the second extractor 135 extracts content, based on the appearance frequencies of the first main phrases (obsolete keyphrases), the second main phrases (universal keyphrases), and the third main phrases (trend keyphrases). Also, by referring to the user-dictionary storage section 123 in which pre-set fifth main phrases (user dictionary keyphrase) are stored, the second extractor 135 extracts content, based on the appearance frequencies of the first main phrases, the second main phrases, the third main phrases, and the fifth main phrases. The second extractor 135 also issues, to a source (the web server 10) from which the reference counts of the pieces of content were obtained, an instruction for deleting the to-be-evaluated content that is included in to-be-evaluated content not extracted and that satisfies a predetermined condition.
The update information for each piece of to-be-evaluated content is input to the updater 136 from the second extractor 135. Upon input of the update instruction from the second extractor 135, the updater 136 executes update processing. Based on the input update information of the to-be-evaluated content, the updater 136 determines whether or not there is deleted content.
Upon determining that there is deleted content, the updater 136 updates the deletion conditions in the condition storage section 125 and the user dictionary in the user-dictionary storage section 123, based on the update information. That is, the updater 136 updates the deletion conditions in the condition storage section 125, based on the appearance rates of the respective classified keyphrases included in the update information for the content for which the deletion information was transmitted. The updater 136 also deletes, from the user-dictionary storage section 123, a keyphrase that matches an obsolete keyphrase included in the content for which the deletion information was transmitted.
Upon determining that there is no deleted content, the updater 136 updates the user dictionary in the user-dictionary storage section 123, based on the update information. That is, the updater 136 adds, to the user-dictionary storage section 123, trend keyphrases included in the update information for the to-be-evaluated content extracted as being content to be maintained. Upon completing the processing on all the input update information, the updater 136 ends the update processing.
Now, update of the deletion conditions will be described with reference to
Table 23 illustrates evaluation results of, for example, five pieces of content “A-1.html” to “A-5.html” in site A. The extraction device 100 extracts content to be deleted, by comparing the evaluation results in Table 23 with the deletion conditions and the update-related condition in Table 22. Table 24 illustrates extracted pieces of content to be deleted, and “A-1.html” and “A-3.html” are pieces of content to be deleted.
The updater 136 updates the deletion conditions and the update-related condition in Table 22, based on the entries in Table 24. With respect to “user dictionary” in the deletion conditions, the updater 136 determines, as a new deletion condition, for example, a value obtained by multiplying the maximum value of appearance rates in the pieces of content to be deleted by a predetermined coefficient (for example, 1.2). In the example illustrated in
With respect to “universal keyphrase” in the deletion conditions, the updater 136 determines, as a new deletion condition, for example, a value obtained by multiplying the maximum value of appearance rates in the pieces of content to be deleted by a predetermined coefficient (for example, 1.2). In the example illustrated in
In other words, the updater 136 updates the appearance frequency setting values for extracting content, based on the appearance frequencies of the first main phrases, the second main phrases, the third main phrases, and the fifth main phrases in content that is included in pieces of content and that was not extracted. Also, when a first main phrase included in content that was not extracted is stored in the user-dictionary storage section 123 in which the fifth main phrases are stored, the updater 136 deletes the fifth main phrase that matches the first main phrase from the user-dictionary storage section 123 in which the fifth main phrases are stored. The updater 136 also stores a third main phrase (a trend keyphrase), included in the extracted content, in the user-dictionary storage section 123 in which the fifth main phrases are stored as a fifth main phrase (a user dictionary keyphrase) to be added.
Next, a description will be given of the operation of the extraction device 100 in the embodiment.
For example, when an administrator of the web server 10 gives an instruction for evaluating each piece of content, the obtainment unit 131 in the extraction device 100 sets to-be-evaluated content and keyphrase extraction source content (step S1). The obtainment unit 131 obtains the to-be-evaluated content and the extraction source content from the corresponding web server 10. The obtainment unit 131 also obtains an access log of the set extraction source content from the corresponding web server 10 (step S2). The obtainment unit 131 outputs the obtained extraction source content and the obtained access log of the extraction source content to the first classifier 132. The obtainment unit 131 also outputs the obtained to-be-evaluated content to the first extractor 133.
Based on the obtained access log, the first classifier 132 classifies the keyphrase extraction source content into the first group and the second group (step S3). The first classifier 132 outputs the extraction source content to the first extractor 133 in conjunction with classification information regarding the classified groups.
Upon input of the extraction source content and the classification information from the first classifier 132, the first extractor 133 extracts keyphrases for each of the classified groups (step S4). The first extractor 133 outputs the extracted keyphrases for each of the groups to the second classifier 134.
Upon input of the keyphrases for each of the groups from the first extractor 133, the second classifier 134 classifies the keyphrases into obsolete keyphrases, universal keyphrases, and trend keyphrases (step S5). The second classifier 134 stores the classified keyphrases in the keyphrase storage section 121.
Upon input of the to-be-evaluated content from the obtainment unit 131, the first extractor 133 extracts keyphrases from the to-be-evaluated content (step S6). The first extractor 133 outputs the extracted keyphrases of the to-be-evaluated content to the second classifier 134. The first extractor 133 also refers to a timestamp of the to-be-evaluated content to obtain the last update date and time of the to-be-evaluated content. The first extractor 133 outputs the obtained last update date and time to the second extractor 135.
Upon input of the keyphrases of the to-be-evaluated content from the first extractor 133, the second classifier 134 refers to the keyphrase storage section 121 and the user-dictionary storage section 123 to classify the input keyphrases. That is, the second classifier 134 classifies the keyphrases extracted from the to-be-evaluated content into user dictionary keyphrases, obsolete keyphrases, universal keyphrases, trend keyphrases, and undefined keyphrases (step S7).
The second classifier 134 executes undefined keyphrase processing (step S8). Now, the undefined keyphrase processing will be described with reference to
The second classifier 134 determines whether or not an undefined keyphrase is included in the classified keyphrases (step S81). When an undefined keyphrase is not included (Negative in step S81), the second classifier 134 ends the undefined keyphrase processing and returns to the original processing. When an undefined keyphrase exists (Affirmative in step S81), the second classifier 134 refers to the undefined-keyphrase storage section 122 to check whether or not each undefined keyphrase has appeared in the past (step S82).
The second classifier 134 determines whether or not the checked undefined keyphrase has occurred in the past (step S83). When the checked undefined keyphrase has occurred in the past (Affirmative in step S83), the second classifier 134 classifies the undefined keyphrase into the obsolete keyphrases (step S84). When the checked undefined keyphrase has not occurred in the past (Negative in step S83), the second classifier 134 stores the undefined keyphrase in the undefined-keyphrase storage section 122 (step S85). When the processing based on whether or not there is an occurrence in the past is completed for all of the undefined keyphrases, the second classifier 134 ends the undefined keyphrase processing, and the process returns to the original processing. Thus, the second classifier 134 can suppress continuously classifying an undefined keyphrase into the undefined keyphrases.
Referring to
Upon input of the classified keyphrases of the to-be-evaluated content from the second classifier 134, the second extractor 135 evaluates the to-be-evaluated content (step S9). That is, the second extractor 135 calculates the appearance rate of user dictionary keyphrases, the appearance rate of obsolete keyphrases, the appearance rate of universal keyphrases, and the appearance rate of trend keyphrases of all the keyphrases extracted from the to-be-evaluated content.
By referring to the condition storage section 125, the second extractor 135 determines whether or not the to-be-evaluated content satisfies the deletion conditions, based on the calculated appearance rates of the respective classified keyphrases (step S10). When the to-be-evaluated content does not satisfy the deletion conditions (Negative in step S10), the second extractor 135 extracts the to-be-evaluated content as being content to be maintained. The second extractor 135 also generates update information including the trend keyphrases of the to-be-evaluated content and outputs the update information to the updater 136. Thereafter, the process proceeds to step S13.
When the to-be-evaluated content satisfies the deletion conditions (Affirmative in step S10), the second extractor 135 sets the to-be-evaluated content as deletion candidate content (step S11) and stores the identifier of the set deletion candidate content in the deletion-candidate storage section 124.
The second extractor 135 executes deletion processing (step S12). Now, the deletion processing will be described with reference to
Based on the last update date and time of the to-be-evaluated content input from the first extractor 133 and the update-related condition in the condition storage section 125, the second extractor 135 determines whether or not a predetermined number of days has elapsed from the last update date (step S121). Upon determining that the predetermined number of days has passed from the last update date (Affirmative in step S121), the second extractor 135 refers to the deletion-candidate storage section 124 to generate deletion information based on the identifier of the deletion candidate content. The second extractor 135 transmits the generated deletion information to the corresponding web server 10 to cause the deletion candidate content to be deleted (step S122). Thereafter, the process proceeds to step S123.
Upon determining that the predetermined number of days has not passed from the last update date (Negative in step S121), the second extractor 135 deletes the identifier of the deletion candidate content to be determined from the deletion-candidate storage section 124 and releases the setting of the deletion candidate content. Thereafter, the process proceeds to step S123.
Upon transmitting the deletion information, the second extractor 135 generates update information for updating the deletion conditions and the user dictionary and outputs the update information to the updater 136. Also, upon releasing the setting of the deletion candidate content, the second extractor 135 generates update information including the trend keyphrases of the to-be-evaluated content for which the setting of the deletion candidate content was released and outputs the update information to the updater 136 (step S123). Upon outputting the update information to the updater 136, the second extractor 135 ends the deletion processing, and then the process returns to the original processing. Thus, the second extractor 135 can control deletion of the deletion candidate content in accordance with the number of days elapsed from the last update date.
Referring back to
Upon input of the update instruction from the second extractor 135, the updater 136 executes update processing (step S15). Now, the update processing will be described with reference to
Based on the update information of to-be-evaluated content, the updater 136 determines whether or not there is deleted content (step S151). Upon determining that there is deleted content (Affirmative in step S151), the updater 136 updates the deletion conditions in the condition storage section 125 and the user dictionary in the user-dictionary storage section 123, based on the update information (step S152).
Upon determining that there is no deleted content (Negative in step S151), the updater 136 updates the user dictionary in the user-dictionary storage section 123, based on the update information (step S153). Upon completing the processing on all the input update information, the updater 136 ends the update processing, and the process returns to the original processing. Thus, the updater 136 can update the deletion conditions and the user dictionary in accordance with whether or not there is deleted content.
Referring to
Next, a specific example in which pieces of content are evaluated and content to be maintained is extracted will be described with reference to
The extraction device 100 first sets content D-1 as to-be-evaluated content and sets the pieces of content E-1 to J-1 as keyphrase extraction source content. The extraction device 100 obtains the to-be-evaluated content D-1 and the extraction source content E-1 to J-1 from the corresponding web server 10. The extraction device 100 also obtains an access log (including the numbers of accesses) of the set extraction source content E-1 to J-1 from the web server 10.
Based on the numbers of accesses, the extraction device 100 classifies the pieces of content E-1 to G-1 into the first group in which the number of accesses is small. The extraction device 100 also classifies the pieces of content H-1 to J-1 into the second group in which the number of accesses is large. The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into obsolete keyphrases, universal keyphrases, and trend keyphrases. In
Next, the extraction device 100 sets the content E-1 as to-be-evaluated content and sets the pieces of content D-1 and F-1 to J-1 as keyphrase extraction source content. The content and the access log may be obtained as in the case of the content D-1, or the content and the access log obtained in the case of the content D-1 may be used, and a description of how the content and the access log are obtained is omitted in the description of the pieces of content F-1 to J-1.
The extraction device 100 classifies the pieces of content D-1, F-1, and G-1 into the first group (in which the number of accesses is small). The extraction device 100 classifies the pieces of content H-1 to J-1 into the second group (in which the number of accesses is large). The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into the obsolete keyphrases, the universal keyphrases, and the trend keyphrases. The extraction device 100 classifies each keyphrase extracted from the content E-1, which is to-be-evaluated content, into one of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases. That is, the extraction device 100 evaluates the content. In the example illustrated in
Next, the extraction device 100 sets the content F-1 as to-be-evaluated content and sets the pieces of content D-1, E-1, and G-1 to J-1 as keyphrase extraction source content.
The extraction device 100 classifies the pieces of content D-1, E-1, and G-1 into the first group (in which the number of accesses is small). The extraction device 100 classifies the pieces of content H-1 to J-1 into the second group (in which the number of accesses is large). The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into the obsolete keyphrases, the universal keyphrases, and the trend keyphrases. The extraction device 100 classifies each keyphrase extracted from the content F-1, which is to-be-evaluated content, into one of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases. That is, the extraction device 100 evaluates the content. In the example illustrated in
Next, the extraction device 100 sets the content G-1 as to-be-evaluated content and sets the pieces of content D-1 to F-1 and H-1 to J-1 as keyphrase extraction source content.
The extraction device 100 classifies the pieces of content D-1 to F-1 into the first group (in which the number of accesses is small). Also, the extraction device 100 classifies the pieces of content H-1 to J-1 into the second group (in which the number of accesses is large). The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into the obsolete keyphrases, the universal keyphrases, and the trend keyphrases. The extraction device 100 classifies each keyphrase extracted from the content G-1, which is to-be-evaluated content, into one of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases. That is, the extraction device 100 evaluates the content. In the example illustrated in
Next, the extraction device 100 sets the content H-1 as to-be-evaluated content and sets the pieces of content D-1 to G-1, I-1, and J-1 as keyphrase extraction source content.
The extraction device 100 classifies the pieces of content D-1 to F-1 into the first group (in which the number of accesses is small). The extraction device 100 classifies the pieces of content G-1, I-1, and J-1 into the second group (in which the number of accesses is large). The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into the obsolete keyphrases, the universal keyphrases, and the trend keyphrases. The extraction device 100 classifies each keyphrase extracted from the content H-1, which is to-be-evaluated content, into one of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases. That is, the extraction device 100 evaluates the content. In the example illustrated in
Next, the extraction device 100 sets the content I-1 as to-be-evaluated content and sets the pieces of content D-1 to H-1 and J-1 as keyphrase extraction source content.
The extraction device 100 classifies the pieces of content D-1 to F-1 into the first group (in which the number of accesses is small). Also, the extraction device 100 classifies the pieces of content G-1, H-1, and J-1 into the second group (in which the number of accesses is large). The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into the obsolete keyphrases, the universal keyphrases, and the trend keyphrases. The extraction device 100 classifies each keyphrase extracted from the content I-1, which is to-be-evaluated content, into one of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases. That is, the extraction device 100 evaluates the content. In the example illustrated in
Next, the extraction device 100 sets the content J-1 as to-be-evaluated content and sets the pieces of content D-1 to I-1 as keyphrase extraction source content.
The extraction device 100 classifies the pieces of content D-1 to F-1 into the first group (in which the number of accesses is small). Also, the extraction device 100 classifies the pieces of content G-1 to I-1 into the second group (in which the number of accesses is large). The extraction device 100 extracts keyphrases for each of the first group and the second group.
The extraction device 100 classifies the keyphrases in the first group and the second group into the obsolete keyphrases, the universal keyphrases, and the trend keyphrases. The extraction device 100 classifies each keyphrase extracted from the content J-1, which is to-be-evaluated content, into one of the user dictionary keyphrases, the obsolete keyphrases, the universal keyphrases, the trend keyphrases, and the undefined keyphrases. That is, the extraction device 100 evaluates the content. In the example illustrated in
For the content D-1, the appearance frequency of obsolete keyphrases is “0.25”, the appearance frequency of universal keyphrases is “0.75”, the appearance frequency of trend keyphrases is “0”, and the appearance frequency of user dictionary keyphrases is “0”. Accordingly, the content D-1 satisfies the deletion conditions and is thus content to be deleted.
For the content E-1, the appearance frequency of obsolete keyphrases is “0.25”, the appearance frequency of universal keyphrases is “0.75”, the appearance frequency of trend keyphrases is “0”, and the appearance frequency of user dictionary keyphrases is “0”. Accordingly, the content E-1 satisfies the deletion conditions and is thus content to be deleted.
For the content F-1, the appearance frequency of obsolete keyphrases is “0.25”, the appearance frequency of universal keyphrases is “0.25”, the appearance frequency of trend keyphrases is “0”, and the appearance frequency of user dictionary keyphrases is “0.25”. Accordingly, the content F-1 does not satisfy the deletion conditions and is thus content to be maintained.
For the content G-1, the appearance frequency of obsolete keyphrases is “0.2”, the appearance frequency of universal keyphrases is “0”, the appearance frequency of trend keyphrases is “0.2”, and the appearance frequency of user dictionary keyphrases is “0.2”. Accordingly, the content G-1 does not satisfy the deletion conditions and is thus content to be maintained.
For the content H-1, the appearance frequency of obsolete keyphrases is “0”, the appearance frequency of universal keyphrases is “0.5”, the appearance frequency of trend keyphrases is “0.17”, and the appearance frequency of user dictionary keyphrases is “0”. Accordingly, the content H-1 does not satisfy the deletion conditions and is thus content to be maintained.
For the content I-1, the appearance frequency of obsolete keyphrases is “0”, the appearance frequency of universal keyphrases is “0.75”, the appearance frequency of trend keyphrases is “0”, and the appearance frequency of user dictionary keyphrases is “0”. Accordingly, the content I-1 does not satisfy the deletion conditions and is thus content to be maintained.
For the content J-1, the appearance frequency of obsolete keyphrases is “0”, the appearance frequency of universal keyphrases is “0.75”, the appearance frequency of trend keyphrases is “0”, and the appearance frequency of user dictionary keyphrases is “0”. Accordingly, the content J-1 does not satisfy the deletion conditions and is thus content to be maintained.
As described above, evaluation results of the pieces of content D-1 to J-1 are that the pieces of content D-1 and E-1 are content to be deleted and the pieces of content F-1 to J-1 are content to be maintained. For example, although the number of accesses to the content F-1 to be maintained is “5”, which is the same as the number of accesses to the content E-1 to be deleted, the content F-1 does not satisfy the deletion conditions and is thus to be maintained, since it includes a user dictionary keyphrase. That is, it is possible for the extraction device 100 to extract content that is likely to be referred to in the future, even if the reference count of the content is small. That is, in the extraction device 100, the number of accesses being small does not directly become a deletion condition, and evaluation is performed through comparison with content in other sites. Thus, content to which the number of accesses is small is not simply deleted.
As described above, the extraction device 100 obtains reference counts that are the numbers of times respective pieces of content were referred to. Based on the reference counts, the extraction device 100 classifies the pieces of content into a plurality of groups. The extraction device 100 extracts main phrases of the content from each of the groups, the main phrases being based on the appearance frequencies of words included in the content. The extraction device 100 extracts the content including a main phrase that appears in all of the groups. As a result, the extraction device 100 can extract content that is likely to be referred to in the future, even if the reference count of the content is small. The extraction device 100 can also reduce the amount of search load during search for content that is likely to be referred to in the future.
Also, the extraction device 100 classifies pieces of content different from to-be-evaluated content into a plurality of groups. When the to-be-evaluated content includes a main phrase that appears in all of the groups, the extraction device 100 extracts the to-be-evaluated content. As a result, the extraction device 100 extracts more appropriate keyphrases.
The extraction device 100 also classifies pieces of content into a first group in which the reference count is small and a second group in which the reference count is large. This allows the extraction device 100 to extract universal keyphrases with respect to the reference counts.
The extraction device 100 also classifies the main phrases extracted from each of the groups into the first main phrases that appear in only the first group, the second main phrases that appear in both of the first and second groups, and the third main phrases that appear in only the second group. The extraction device 100 also extracts content, based on the appearance frequencies of the first main phrases, the second main phrases, and the third main phrases. This allows the extraction device 100 to extract content by using keyphrases according to the reference counts.
Also, the extraction device 100 stores, in the undefined-keyphrase storage section 122, a fourth main phrase that is included in the main phrases extracted from the to-be-evaluated content and that is a main phrase not corresponding to any of the first main phrases, the second main phrases, and the third main phrases. During next content extraction, when a fourth main phrase extracted from the to-be-evaluated content matches any of the fourth main phrases stored in the undefined-keyphrase storage section 122, the extraction device 100 classifies the extracted fourth main phrase into the first main phrases. This allows the extraction device 100 to classify a keyphrase that appears in only the to-be-evaluated content into the obsolete keyphrases.
Also, by referring to the user-dictionary storage section 123 in which pre-set fifth main phrases are stored and based on the appearance frequencies of the first main phrases, the second main phrases, the third main phrases, and the fifth main phrases, the extraction device 100 extracts content. This allows the extraction device 100 to inhibit mistakenly deleting content, by designating a keyphrase included in content desired to be maintained.
In other words, the extraction device 100 updates the appearance frequency setting values for extracting content, based on the appearance frequencies of the first main phrases, the second main phrases, the third main phrases, and the fifth main phrases in content that is included in pieces of content and that was not extracted. This allows the extraction device 100 to more appropriately extract content that is desired to be maintained.
Also, when a first main phrase included in content that was not extracted is included in the user-dictionary storage section 123 in which the fifth main phrases are stored, the extraction device 100 deletes the fifth main phrase that matches the first main phrase from the user-dictionary storage section 123 in which the fifth main phrases are stored. As a result, the extraction device 100 can delete an obsolete keyphrase from the user dictionary.
The extraction device 100 also stores a third main phrase, included in the extracted content, in the user-dictionary storage section 123 in which the fifth main phrases are stored as a fifth main phrase to be added. This allows the extraction device 100 to register a trend keyphrase in the user dictionary.
The extraction device 100 also issues, to a source from which the reference counts of the pieces of content were obtained, an instruction for deleting the to-be-evaluated content that is included in to-be-evaluated content not extracted and that satisfies a predetermined condition. This allows the extraction device 100 to delete obsolete content from the web server 10.
In the above-described embodiment, when a predetermined number of days has passed from the last update date of deletion candidate content, the deletion information is transmitted to the corresponding web server 10 to delete the deletion candidate content, but the present disclosure is not limited thereto. For example, the deletion information may be transmitted to a terminal apparatus (not illustrated) used by the administrator of the corresponding web server 10, and after obtaining approval from the administrator, the web server 10 may delete the deletion candidate content.
Although, in the above-described embodiment, all content in a site of interest is evaluated, the present disclosure is not limited thereto. For example, if subordinate content linked from certain content does not have a link from other superordinate content, the subordinate content may be deleted together with content in a source of the link.
Also, although, in the above-described embodiment, keyphrase extraction source content is classified into the two groups, the present disclosure is not limited thereto. For example, keyphrase extraction source content may be classified into three or more groups in accordance with the number of accesses to the content.
Although, in the above-described embodiment, the number of accesses (reference count) is obtained based on the access log of each piece of content in the web server 10, the present disclosure is not limited. For example, an access counter may be provided for each piece of content to aggregate the number of accesses.
The constituent elements of the illustrated units and portions may or may not be physically configured as illustrated. That is, specific forms of distribution/integration of the units and portions are not limited to those illustrated, and all or some thereof may be functionally or physically distributed or integrated in an arbitrary manner, depending on various loads, usage states, and so on. For example, the second extractor 135 may be configured as a functional unit from which the deletion processing is separated. The illustrated processes are not limited to the above-described order. For example, the processes may be performed at the same time, or the order of the processes may be interchanged for execution, as long as such a change does not cause contradiction in details of processing.
In addition, all or any of the processing functions of each apparatus may also be executed by a CPU (or a microcomputer, such as an MPU or a micro controller unit (MCU)). Needless to say, all or any of the processing functions may also be executed on a program analyzed and executed by a CPU (or a microcomputer, such as an MPU or MCU) or on wired-logic-based hardware.
The various types of processing described in the above embodiment may be realized by executing a prepared program with a computer. Accordingly, a description below will be given of an example of a computer that executes a program having functions that are analogous to those in the above-described embodiment.
As illustrated in
An extraction program having functions that are the same as or similar to those of the processing units, that is, the obtainment unit 131, the first classifier 132, the first extractor 133, the second classifier 134, the second extractor 135, and the updater 136, illustrated in
The CPU 201 reads programs stored in the hard-disk device 208, loads the programs into the RAM 207, and executes the programs to thereby perform various types of processing. These programs also allow the computer 200 to function as the obtainment unit 131, the first classifier 132, the first extractor 133, the second classifier 134, the second extractor 135, and the updater 136 illustrated in
The above-described extraction program may or may not be stored in the hard-disk device 208. For example, the computer 200 may read and execute the program stored on/in a storage medium that is readable by the computer 200. Examples of the storage medium that is readable by the computer 200 include portable recording media, such as a compact disc read-only memory (CD-ROM) a digital versatile disc (DVD), and a Universal Serial Bus (USB) memory, a semiconductor memory, such as a flash memory, and a hard-disk drive. The extraction program may be stored in a device connected to a public line, the Internet, a LAN, or the like, and the computer 200 may read therefrom and execute the extraction program.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2017-119271 | Jun 2017 | JP | national |