This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2015-040783, filed on Mar. 2, 2015, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a recording medium having stored therein a data management program, a data management device, and a data management method.
A data storage system stores a large amount of data in a storage such as a disk. A low-speed storage such as a disk has a low throughput per unit time (a high cost), and therefore a cache technology is used.
The cache technology is a technology for reducing processing time by using a memory when a controller that is high in processing speed reads data at a higher speed from a low-speed storage. When the controller reads data from the low-speed storage, the read data is temporality stored in a memory, and this allows the data to be read from the memory capable of performing higher-speed reading/writing than the low-speed storage from the next time.
However, when a large amount of data that exceeds a capacity of a memory is processed, access to a disk frequently occurs, and consequently performance of data processing greatly deteriorates.
Accordingly, as one of cache technologies, a technology for collecting mutually relevant data in the same segment in accordance with an access history so as to perform data rearrangement (hereinafter referred to as a “data rearrangement technology”) has been proposed (for example, Patent Document 1).
Patent Document 1: International Publication Pamphlet No. WO 2013/114538.
Patent Document 2: Japanese Laid-open Patent Publication No. 7-200389
Patent Document 3: Japanese Laid-open Patent Publication No. 2014-142749
Patent Document 4: Japanese Patent No. 5413867
According to one aspect, a non-transitory computer-readable recording medium having stored therein a data management program causes a computer to execute the process described below. The computer monitors a relevance ratio between pieces of data based on a frequency of access to a pair for each of the pairs of the pieces of data consecutively accessed in response to a request for access to a storage device storing a plural pieces of data. The computer determines whether the pair is a pair having a relevance ratio representing a specified tendency, on the basis of tendencies of the monitored relevance ratios of the pairs. The computer groups the plural pieces of data according to a result of the determining and the relevance ratio, and specifies data to be arranged in each group.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
The data pair refers to two pieces of data that are consecutively accessed. Data accessed currently and data accessed just before form a pair, and a frequency at which the pair appears is recorded.
As illustrated in
When relevance between pieces of data is represented by using a graph, the pieces of data A, B, C, D, and E have a structure illustrated in
In order to arrange these pieces of data into two segments, these pieces of data are divided into a group including the pieces of data A, B, and C, and a group including the pieces of data D and E, as illustrated in
As described above, relevant pieces of data are collected in the same segment in accordance with the intensity of relevance of a data pair that is accumulated during a prescribed period so as to perform data rearrangement.
It is difficult to continue to accumulate all access histories and all pieces of relevance information as described above, and therefore histories during a prescribed period are recorded. As an example, a history of access to data on a cache is recorded while the data is stored on the cache. In this case, the intensity of relevance accumulated during a prescribed period is watched.
By using the above data rearrangement technology, when a tendency of an access history does not change, data arrangement having a high access efficiency is realized.
However, the tendency of the access history is not always stationary. When there is a data pair that greatly changes in relevance, there are the following concerns. When the tendency of the access history changes, data rearrangement is performed according to a change in the tendency. However, there is a data pair that changes in a relevance ratio more frequently than the tendencies of all of the access histories change, data rearrangement is performed more frequently than needed, and inefficient tasks are performed.
When relevance greatly changes in the middle of an accumulation period during which relevance information of data is accumulated, there are the following concerns. As an example, when data arrangement is determined without considering that pieces of data forming a data pair become irrelevant to each other, data arrangement based on relevance that no longer exists, namely, inefficient data arrangement is performed.
In one aspect of the invention, a technology is provided that enables data arrangement having a high reading efficiency according to a change in a tendency of a data access situation.
The problems above are described in further detail. A case in which there is a data pair that greatly changes in relevance is described first.
In a case in which relevance between pieces of data changes by time t1 in such a way that a relevance ratio between the pieces of data A and B decreases, and a relevance ratio between the pieces of data C and D increases, rearrangement is performed, and the pieces of data A, C, and D are arranged in one segment, and the data B is arranged in the other segment.
In the data rearrangement technology, relevance information prior to time t1 is not accumulated in order to prevent resources from being wasted, and therefore a change in relevance between pieces of data fails to be watched from time t0 to time t1. However, in the conventional data rearrangement technology, an accumulated value of the number of accesses to relevant data is stored.
Accordingly, unlike
Actually, the pieces of data C and D have a high relevance. Therefore, when the data C is accessed, it is highly probable that the data D is also accessed. However, the pieces of data C and D are not arranged in the same segment. Accordingly, it is highly probable that one of the pieces of data C and D does not exist in a memory, and another access to a disk is needed.
As described above, when data arrangement is determined without considering that relevance of a data pair no longer exists, data arrangement is performed according to relevance that no longer exists, and an effect of improvements in reading efficiency due to rearrangement is not exhibited in some cases.
A case in which relevance greatly changes in the middle of an accumulation period of relevance information is described next.
In
When rearrangement is performed according to a change in the relevance ratio, pieces of data are frequently replaced. Therefore, even when rearrangement is performed, an effect of improvements in a reading efficiency due to rearrangement is lost soon, and this results in a decrease in data throughput. Accordingly, as illustrated in
As illustrated in
Thus, when optimum data arrangement is determined, it is preferable that valid information and invalid information for rearrangement be distinguished from each other. The purpose of this is, for example, to exclude invalid information from a target to be rearranged.
Accordingly, the relevance ratio for each of the data pairs can be recorded as time-series information without recording the relevance ratio by using an accumulated value.
However, in the data rearrangement technology, a case in which data pairs have respective different tendencies of the relevance ratio is not considered, and relevance information of each of the data pairs is handled equivalently. Therefore, influence of invalid relevance information fails to be excluded.
Accordingly, in the embodiments, arrangement is not determined according to a relevance ratio every time the relevance ratio changes, but arrangement is determined while considering a tendency of the relevance ratio during a prescribed period (accumulation period). In addition, according to the tendency of the relevance ratio during a prescribed period, valid relevance information and invalid relevance information for the determination of arrangement are distinguished from each other.
The monitor 2 intermittently monitors a relevance ratio between pieces of data based on an access frequency to a data pair for each pair of pieces of data that are consecutively accessed in response to a request for access to a storage device storing a plural pieces of data. An example of the monitor 2 is the relevance extracting unit 22 described later.
The determining unit 3 determines whether a pair is a pair having a relevance ratio representing a specified tendency on the basis of tendencies of the intermittently monitored relevance ratios of the pairs. An example of the determining unit 3 is the statistics processing unit 23 described later.
The specifying unit 4 groups pieces of data in accordance with a determination result and the relevance ratio, and specifies pieces of data to be arranged in each group. An example of the specifying unit 4 is the arrangement determining unit 24 described later.
The configuration above enables data arrangement having a high reading efficiency according to a change in a tendency of a data access situation.
The monitor 2 divides a period during which a tendency of a relevance ratio is observed (an accumulation period) into a plurality of periods, and intermittently monitors a relevance ratio between pieces of data based on a frequency of access to a pair during each of the divided periods.
In the configuration above, the tendency of the relevance ratio between pieces of data forming a pair can be intermittently monitored during each of the divided periods.
The determining unit 3 calculates a mean or a standard deviation of the intermittently monitored relevance ratio during each of the divided periods for each of the pairs, and specifies a pair having a relevance ratio whereby the calculated mean or standard deviation satisfies a specified condition.
In the configuration above, invalid relevance information, such as a data pair that stationarily has a low relevance ratio, a data pair that greatly changes in the relevance ratio, or a data pair that changes in a tendency of the relevance ratio, can be specified.
The specifying unit 4 calculates a relevance ratio during an observation period for each pair by performing weighting on the mean of the relevance ratio during each of the divided periods of a pair as opposed to a pair having a relevance ratio representing a specified tendency in such a way that a weight is reduced in an order that goes back to the past from an immediately previous divided period.
In the configuration above, the immediately previous data pair has a greater weight of the relevance ratio, and a current relevance ratio is further reflected.
The specifying unit 4 groups pairs as opposed to a pair having a relevance ratio representing a specified tendency.
In the configuration above, when data arrangement for each segment is determined according to relevance between pieces of data, invalid relevance information can be excluded.
The embodiments are described below in detail.
The server 11 includes a controller 12, a memory device (hereinafter referred to as a “memory”) 13, and a storage device (disk) 14. The controller 12 is a processor such as a central processing unit (CPU).
Examples of the storage device 14 include a disk drive such as a hard disk drive (HDD). Hereinafter, the storage device 14 is referred to as a disk 14.
The memory 13 is a storage that is accessible at a higher speed than the disk 14. Examples of the memory 13 include a RAM (Random Access Memory), a flash memory, and the like.
The server 11 includes a ROM storing a BIOS (Basic Input/Output System), a program memory, and the like, in addition to the configuration above. A program executed by the controller 12 may be obtained via the network 16, or may be obtained by a removable memory or a removable computer-readable recording medium such as a CD-ROM being mounted onto the server 11. The program executed by the controller 12 includes a program for executing a process described in the embodiments.
Next, the accumulation period T is divided into a plurality of sub-periods Tm, and the plurality of sub-periods Tm are further divided into a plurality of sub-sub-periods Ts. The number of consecutive accesses to each of the data pairs during each of the sub-sub-period Ts is measured. In order to determine whether relevance information of each of the data pairs is valid, a mean value and a standard deviation of the relevance ratio are calculated from a change in the number of accesses during each of the sub-sub-periods Ts within the sub-period Tm. As described later, a final relevance ratio for a data pair having valid relevance information is calculated on the basis of a mean relevance ratio during the accumulation period T.
The memory 13 stores a data/segment correspondence table 32, a relevance management table 33, and relevance statistics management information 34.
The data/segment correspondence table 32 stores information indicating a correspondence relationship between data and a segment that is an arrangement destination of the data.
The relevance management table 33 stores the number of accesses to a data pair (a relevance ratio), namely, relevance information, during each of the sub-sub-periods Ts within the sub-period Tm.
The relevance statistics management information 34 includes relevance statistics information and relevance statistics (mean) information. The relevance statistics information stores information obtained by performing statistical processing on relevance information during each Tm. The relevance statistics (mean) information is information in which plural pieces of relevance statistics information relating to a mean value during the accumulation period T are collected.
The controller 12 functions as an input/output managing unit 21, a relevance extracting unit 22, a statistical processing unit 23, and an arrangement determining unit 24 by executing a program according to the embodiments.
The input/output managing unit 21 searches the memory 13 in response to a request input from a request source such as the client 15. When data specified in the request does not exist in the memory 13, the input/output managing unit 21 further searches the disk 14, and transmits the data specified in the request to the request source. The request is not always transmitted by the client 15, and another subject of a process performed in the server 11 may be a request issuer. When an input/output device is connected to the server 11, it is assumed that a user inputs a request to the input/output device.
When a request is input, the input/output managing unit 21 searches the memory 13 for data specified in the request. When the data specified in the request exists on the memory 13, the input/output managing unit 21 reads the data from the memory 13, and transmits the data to the request source.
When the data specified in the request does not exist on the memory 13, the input/output managing unit 21 searches the disk 14 for the data specified in the request. When the data specified in the request exists on the disk 14, the input/output managing unit 21 reads all pieces of data included in a segment to which the data specified in the request belongs from the disk 14, by using the data/segment correspondence table 32. The input/output managing unit 21 transmits, to the request source, the data specified in the request from among all pieces of data included in the read segment. In this case, the input/output managing unit 21 stores all pieces of data included in the read segment in the memory 13.
A case in which a process of storing all pieces of data included in the segment read from the disk 14 in the memory 13 is performed at a timing of the issuance of a request has been described above, but the storing process is not limited to this. As an example, the input/output managing unit 21 may obtain an access frequency during a prescribed period, read a segment having a high access frequency from the disk 14 with priority, and store the segment in the memory 13.
The relevance extracting unit 22 monitors a relevance ratio between pieces of data based on a frequency of access to a data pair during each of the sub-sub-periods Ts. More specifically, the relevance extracting unit 22 extracts a data pair consecutively accessed during each of the sub-sub-periods Ts from an access sequence, and increments a frequency of access to the data pair (the relative ratio) by 1.
The statistical processing unit 23 performs statistical processing on the relevance ratio of the pair monitored during each of the sub-sub-periods Ts, and determines whether the pair is a pair having a relevance ratio representing a specified tendency on the basis of a tendency of the relevance ratio obtained as a result of the statistical processing. More specifically, the statistical processing unit 23 calculates a statistic of relevance information for each of the sub-periods Tm by using the relevance management table 33, and invalidates invalid relevance information.
The arrangement determining unit 24 groups pieces of data according to the determination result and the relevance ratio, and specifies pieces of data to be arranged in each of the groups (segments). More specifically, the arrangement determining unit 24 specifies pieces of data to be arranged in each of the segments for each of the accumulation periods T on the basis of relevance information between pieces of data as opposed to the invalidated relevance information. Then, the arrangement determining unit 24 clears the content of the relevance management table 33 and the content of the relevance statistics management information 34.
The relevance statistics information table 34a stores information (a mean value and a standard deviation) obtained by performing statistical processing on relevance information of a data pair during each Tm, by using the relevance management table 33. In the relevance statistics information table 34a, an invalid flag is added to the relevance information of the data pair when a result of the statistical processing satisfies a prescribed condition (for example, mean value ≦1, or standard deviation ≧1).
The relevance statistics (mean) information table 34b represents information in which a plurality of mean values during the accumulation period T are collected from the relevance statistics information table 34a. When the relevance statistics information table 34a is summarized, a mean value of a data pair to which the invalid flag has been added is set to “0”. In the relevance statistics (mean) information table 34b, a data pair to which the invalid flag has been added in the middle of the accumulation period T, such as a data pair C-A, is also added to the invalid flag.
At the last of the accumulation period T, data arrangement is determined by using valid relevance information in the data rearrangement technology.
First, the relevance extracting unit 22 extracts a consecutively accessed data pair from an access sequence. As described above with reference to
When a plural pieces of relevance information during the sub period Tmi are recorded, the statistical processing unit 23 calculates a statistic (such as a mean value or a standard deviation) of the relevance information (the number of accesses) of each of the data pairs, as illustrated in
The statistical processing unit 23 regards, as invalid information, information of a data pair that satisfies either of a condition whereby the mean value of the number of accesses is less than or equal to a threshold, or a condition whereby the standard deviation is greater than or equal to a threshold, in the relevance statistics information table 34a, as illustrated in
The statistical processing unit 23 only leaves the mean value for each of the sub-periods Tmi in the relevance statistics information table 34a during the accumulation period so as to generate the relevance statistics (mean) information table 34b, as illustrated in
When a plural pieces of information in the relevance statistics information table 34a for the accumulation period T are stored, namely, when the relevance statistics (mean) information table 34b is generated, the arrangement determining unit 24 performs a process that follows. Specifically, the arrangement determining unit 24 performs weighting on the mean value of the relevance ratio for each of the sub-periods of a data pair in which the invalid flag has not been set in such a way that a weight increases as a time period elapses, and calculates a final relevance ratio for each of the data pairs (S5). The process of S5 is described below with reference to
The arrangement determining unit 24 deletes the relevance statistics management information 34 after calculating the final relevance ratio (S6).
The controller 12 repeats the processes of S1 to S6 during each of the accumulation periods. In the relevance statistics information table 34a and the relevance statistics (mean) information table 34b, rows in which the invalid flag has been set may be appropriately deleted, or may be ignored when calculating optimum arrangement.
The arrangement determining unit 24 extracts data pairs in which the invalid flag has not been set from the relevance statistics (mean) information table 34b, and calculates a final relevance ratio for each of the data pairs by using the expression described below. A weight on a sub-period k (k=1 to N, and N: the number of sub-periods) is determined as described below. When an Exponentially Weighted Moving Average is used, the arrangement determining unit 24 exponentially reduces a weight in an order from an immediately previous sub-period to a past sub-period.
Assume that a relevance ratio of a data pair X-Y during the sub-period k is Pk. The arrangement determining unit 24 obtains a final relevance ratio REL of the data pair X-Y during the accumulation period T by using the following expression.
RELX-Y=α×(PN+(1−α)PN-1+(1−α2)PN-2+ . . . )
where α represents a smoothing coefficient (0 to 1) for obtaining a degree of a decrease in a weight, and is specified in advance.
As illustrated in
A left-hand portion of
The relevance information of the data pair G-I drastically changes, and therefore the relevance information is assumed to satisfy the condition (standard deviation ≧1). In this case, the statistical processing unit 23 determines that the relevance information of the data pair G-I is invalid.
The relevance information of the data pair H-J stationarily has a low relevance ratio, and therefore the relevance information is assumed to satisfy the condition (mean value ≧1). In this case, the statistical processing unit 23 determines that the relevance information of the data pair H-J is invalid. Accordingly, relevance information of the data pairs F-G, F-H, G-H, and I-J, which has not been determined to be invalid, is valid.
As in the description above of the process of S5 in
A graph structure of valid relevance information is represented as illustrated in an upper-right portion of
The input/output managing unit 21 reads (accesses) data specified in a request input from a request source from the memory 13 or the disk 14, and transmits the data to the request source (S11). When the data specified in the request does not exist in the memory 13, the input/output managing unit 21 reads all pieces of data included in a segment to which the data specified in the request belongs from the disk 14, by using the data/segment correspondence table 32. The input/output managing unit 21 then transmits, to the request source, the data specified in the request from among all read pieces of data included in the segment.
The relevance extracting unit 22 specifies a current sub-period Tm_k from among sub-periods within the accumulation period T (S12).
The relevance extracting unit 22 updates information of the sub-period Tm_k in the relevance management table (S13). Specifically, the relevance extracting unit 22 records relevance information of the extracted data pair for each of the sub-sub-periods Ts within the sub-period Tm_k, in the relevance management table 33 (namely, the relevance extracting unit 22 increments the number of accesses by 1), as in the description above of the process of S1 in
During the sub-period Tm, the relevance extracting unit 22 repeats the processes of S11 to S13 (“YES” in S14).
When the sub-period Tm has passed (“NO” in S14), the statistical processing unit 23 calculates relevance statistics information on the basis of the relevance information during the sub-period Tm_k (S15). Specifically, as in the description above of the process of S2 in
The statistical processing unit 23 adds an invalid flag to invalid information in the generated relevance statistics information (S16). Specifically, the statistical processing unit 23 regards information of a data pair that satisfies either of a condition whereby the mean value of the number of accesses is less than or equal to a threshold and a condition whereby the standard deviation is greater than or equal to a threshold in the relevance statistics information table 34a (
When the accumulation period T has not yet passed (“YES” in S17), the process returns to S11, and the processes of S11 to S16 are performed during the subsequent sub-period Tm_k+1.
When the accumulation period T has passed (“NO” in S17), the statistical processing unit 23 adds an invalid flag to invalid information in the relevance statistics information (S18). Specifically, as in the description above of the process of S4 in
The arrangement determining unit 24 calculates final relevance information (S19). Specifically, as in the descriptions above of the process of S5 in
Then, the arrangement determining unit 24 determines whether data arrangement needs to be changed, on the basis of the calculated final relevance ratio for each of the data pairs (S20). In this case, the arrangement determining unit 24 determines whether the correspondence of data and a segment needs to be changed, namely, whether segments needs to be reorganized, on the basis of the calculated final relevance ratio for each of the data pairs. As described with reference to
When data arrangement does not need to be changed, namely, when it is determined that the correspondence of data and a segment does not need to be changed (“NO” in S20), the arrangement determining unit 24 terminates the process in this flowchart.
When data arrangement needs to be changed, namely, when it is determined that the correspondence of data and a segment needs to be changed (“YES” in S20), the arrangement determining unit 24 performs a process that follows. Specifically, the arrangement determining unit 24 changes the correspondence of data and a segment on the basis of a result of reconfiguration of segments performed in S20 (S21).
The arrangement determining unit 24 updates the data/segment correspondence table 32 on the basis of the changed correspondence relationship between data and a segment (S22).
Then, the arrangement determining unit 24 deletes the relevance management table 33 and the relevance statistics management information 34 (S23).
According to the embodiments, valid relevance information and invalid relevance information for rearrangement are distinguished from each other. As a result, when the invalid relevance information is appropriately deleted, an amount of data stored for optimization can be reduced. When the invalid relevance information is not used in the calculation of arrangement, targets for the calculation process can be reduced. Further, when the invalid relevance information is not used in the calculation of arrangement, arrangement that appears to be valid (namely, arrangement that temporarily has a high relevance ratio), but actually has a low effect (namely, arrangement whereby an effect is lost immediately after rearrangement) can be prevented.
According to an aspect of the invention, data arrangement can be attained that has a high efficiency in reading in accordance with a change in a tendency of a data access situation.
The invention is not limited to the embodiments described above, and various configurations and embodiments can be embodied without departing form the spirit of the invention.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2015-040783 | Mar 2015 | JP | national |