Embodiments of the present disclosure are directed to methods for improving the efficiency of write operations in phase change random access memory (PRAM).
Phase-Change RAM (PRAM) is currently a leading next-memory approach. However, although single cell write and read power are low cost and competitive, packing multiple cells in dense cross-point array architectures cause memory operations to consume high power due to effects on unselected cells. The resulting energy magnitude offsets cell advantages. Therefore, array-scale process design and data management algorithms are needed to cope with redundant power issues.
To switch a cell from high to low-resistance state, a high voltage drop is required. The corresponding wordline and bitline are assigned with the required potentials, but the unselected cells that are placed on the same row or column also suffer from voltage variations. Attempts to reduce the resulting unwanted potentials by using additional voltage on unselected bitlines pass the problem to unselected cells on those columns and to assignment of voltage in other wordlines. Eventually, a set of four voltages, for the selected and unselected rows and columns, is used to mitigate unwanted power and write disturb. The equivalent process is performed when transforming cells from low to high-resistance state, with reverse voltage drops, possibly in other values.
Write operations are the main source of power consumption in emerging phase change memory (PCM) technologies, such as phase change RAM (PRAM). In a cross-point vertical array architecture, a set of voltages or high impedances is applied to all wordlines and bitlines such that the intended write voltage is the potential difference over target cells, causing their resistance to be switched. However, unselected cells also have unintended voltage drops. Since the resistance gap between low and high cell states is at least about an order of magnitude, almost all unwanted power, and consequentially most of total write power, is consumed by low-resistance cells, located in unselected wordlines.
Exemplary embodiments of the present disclosure are directed to systems and methods for improving PRAM write algorithms by combining a write algorithm with selective valid pages read and copy-back. Analysis shows that up to a 40% power reduction can be obtained as compared to prior art algorithms. Embodiments of the disclosure can be implemented by modifying the software of an SSD controller. A method according to an embodiment of the disclosure is scalable as PRAM blocks occupy more wordlines.
According to an embodiment of the disclosure, there is provided a method for performing a write operation in a random access memory (RAM), including selecting a target block in a RAM with a greatest number of invalid pages, reading valid pages from target block, when a number of invalid pages is greater than a predetermined threshold, performing a bitline-wise block erase of the target block in said RAM, and copying-back valid data to the erased target block in a row-by-row set operation, wherein the erased target block is written with the valid data.
According to a further embodiment of the disclosure, the method includes receiving an incoming write request, and writing data of the incoming write request to the target block in the row-by-row set operation, wherein the incoming data is written to the erased block along with the valid data.
According to a further embodiment of the disclosure, the method includes placing data of the incoming write request in a write buffer of a controller of said RAM.
According to a further embodiment of the disclosure, performing the bitline-wise block erase comprises sequentially powering on each bitline with a predetermined reset voltage wherein all other bitlines and wordlines are grounded.
According to a further embodiment of the disclosure, the method includes, when the number of invalid pages is less than or equal to the predetermined threshold, performing a sub-block write operation of data received with a write request.
According to a further embodiment of the disclosure, the RAM is one selected from a group comprising a phase change random-access memory, a resistive random-access memory, a ferroelectric random-access memory, and a magnetoresistive random-access memory.
According to a further embodiment of the disclosure, the bitline-wise block erase is one of a partial block erase or a full block erase.
According to a further embodiment of the disclosure, the steps of selecting a target block, reading valid pages from target block, performing a bitline-wise block erase of the target block, and copying-back valid data to the erased target block are performed as one of an automatic refresh operation or as an operation initiated by a user.
According to a further embodiment of the disclosure, the method includes mapping a logical address of a bitline of the target block to a virtual address according to f({right arrow over (x)})={right arrow over (x)}·P·{right arrow over (b)}+r+c·q (mod 2n+z), wherein {right arrow over (x)} is an n-bit logical address, f({right arrow over (x)}) is the virtual address, n is a log (base 2) of the logical address space size, P is an invertible n×n permutation matrix, {right arrow over (b)}=(2n-1,2n-2, . . . , 1)T is a vector that converts the binary n-bit address to a natural number, z is a current number of spare lines,
r is a round number, wherein a round is an act of remapping all lines in the target block, and s is a step number, wherein a step is an act of re-mapping one bitline.
According to a further embodiment of the disclosure, the method includes mapping the virtual address of the bitline of the target block to a physical address by mapping the virtual address v to v+i where i is the index of a maximum value s such that si<v, wherein si∈S=(s1, . . . , sl) wherein a value of si is a largest virtual address that maps to a physical address below bi wherein bi∈(b1, . . . , bl), a sorted list of physical addresses of bad lines wherein l is a current number of bad lines.
According to another embodiment of the disclosure, there is provided a non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executed by the computer to perform the method steps for a write operation in a random access memory (RAM).
According to another embodiment of the disclosure, there is provided a method of performing a write operation in a random access memory (RAM), including selecting a target block in a RAM with a greatest number of invalid pages, and performing a sub-block write operation of data received with a write request, when the number of invalid pages is less than or equal to the predetermined threshold.
Exemplary embodiments of the disclosure as described herein generally provide systems and methods for improving PRAM write algorithms by combining a write algorithm with selective valid pages read and copy-back. While embodiments are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.
Past research has indicated that write power can be reduced by writing pages in block resolution rather than wordline. This scheme, referred as sub-block write aggregation, was to buffer a group of write requests, select the block with the most invalid wordlines, and then perform all phase RESETs, which modify cells to high-resistance, in a row-by-row manner, followed by a SET to group cells separately. In this algorithm, the unintended voltage drops are located mostly on high-resistance cells. Analysis of the results showed there was an up to 85% energy reduction as compared to random wordline writes. The scheme was expanded to include multi-level cell (MLC) writes and to manage opportunistic power-saving writes.
Embodiments of the disclosure are directed to methods of improving PRAM write algorithms to save additional energy, by combining a write algorithm with selective valid pages read and copy-back. It has been observed that performing a bitline-wise RESET to a whole block, by setting all bitlines with the RESET voltage while all wordlines are grounded at zero volts, is highly power-efficient. Therefore, embodiments of the disclosure are directed to a case of copying valid pages, execute bitline-wise block erase, and writing back the valid data along with new incoming write requests. When the number of invalid wordlines is low, the previous sub-block write scheme is expected to be more power-efficient, whereas when most of the wordlines are invalid, it is preferable to read valid pages, perform bitline-wise erases and copy-back the data along with the new data write. A consolidation scheme according to an embodiment calculates the number of invalid pages in which such a process has advantage over sub-block write aggregation. Power dissipation can be modeled in both processes to find the number of invalid pages in which copy-back outperforms sub-block, and this selection can be integrated with online write algorithm.
Notation 1 (Cross-Point Array Size): The number of (rows)×(columns) in a memory cells matrix is denoted with MXN, where the cross-point array wordlines (rows) and bitlines (columns) are denoted with M and N. In addition, WIJBL are the abbreviations for wordline/bitline, respectively.
Notation 2 (Data Distribution and Cell's States): Data according to embodiments is assumed to be distributed binary with Bernoulli-(½) probability. The device is considered as binary (single set level cell—SLC) with low-resistance state (LRS) marked with RrL and high-resistance (HRS) with RH. An MLC generalization is given below.
Notation and Definition 3 (Random Wordline Write): A write process according to an embodiment includes two steps: SET and RESET (or vice versa). In a RESET phase, cells are switched to HRS (RESET) by applying a reset voltage VRST on the wordline and grounding (0 volt) target cell bitlines. The unselected rows/columns are powered with VRWI/VRBL. If a prior read is carried and only the LRS cells among the involved cells are selected, power consumption is reduced. In a SET phase, cells are switched to LRS (SET). The voltage VSET is used for target wordline, intended cells bitlines are grounded and VSWI/VSBL are used for unselected worldines/bitlines.
Notation and Definition 4 (Sub-Block Write): An algorithm according to an embodiment buffers several write requests, selects the block with the most invalid pages, performs RESET to all invalid pages according to new data, using a prior read to avoid power waste, followed by a proper row-by-row SET.
Notation and Definition 5 (Bitline-Wise Block Erase and SET): A first erase phase according to an embodiment sets VRST in each bitline, while other bitlines are grounded, and grounds all wordlines. Next, the block is written with wordline SET operations, i.e., VSET on target wordline, selected cells bitlines are grounded and unselected rows/columns are powered with VSWI/VSHL.
Notation and Definition 6 (RESET/SET and Read Power): According to embodiments, the power to switch a cells from a low to high, or a high to low, resistance state is denoted PRST or PSET, respectively. The consumed power during wordline read is denoted PRD.
Notation and Definition 7 (Write Power of a Single Memory Cell): According to embodiments, the power needed to change a single cell's resistance state from low to high (reset) is denoted and defined as:
P
RST=∫R
where I(R) is the current through the PRAM device at resistance R. The alternative cell's high to low write (set) power is:
P
SET=∫R
The latencies of set and reset are assumed to be relatively the same: tSET=tRST, and therefore the energy is proportional to power. In case that the set time differs from the reset time, the analysis still holds and derivation of energy data would require multiplying the power with the corresponding time duration.
Notation and Definition 8 (Read Operation and Power): According to embodiments, a read is performed by powering the target wordline with VRD when all bitlines are grounded (0v). The current at each bitline is measured for the cell's resistance estimation. The power of the read process (with random data distribution) is:
The read power includes a parasitic component that originates from sneak current. This fraction may be considered to be negligible when compared to the overall power and may be omitted in a gain analysis.
According to an embodiment, an analytical model of power consumption is constructed for a consolidated copy-back and write method.
1) Random Wordline RESET&SET with Prior Read
In a planned read operation according to an embodiment, the location of HRS cells is known. Out of N/2 planned HRS, half are already in that state and reset is executed on N/4 cells. The unselected cells on target wordline are N/2 HRS and N/4 in LRS and are affected with VRBL. The remaining wordlines are N/4 cells affected by VRST−VRWL and 3N/4 cells affected by VRBL−VRWL with equally distributed LRS and HRS cells. The power of the reset phase with known wordlines is:
Similarly, the LRS cells are known prior to the set phase and N/4 out of planned N/2 are already in LRS. The remaining wordline cells are at HRS due to previous reset phase.
The overall power in reset-before-set with prior read is:
Note that RESET-before-SET and SET-before-RESET were found to have about the same power consumption for a random wordline.
2) Sub-Block RESET&SET Aggregation with Prior Read
Since it is not common that a whole block contains invalid pages, a write operation according to an embodiment can be performed in sub-block resolution. The selected block has i invalid wordlines, which are first all RESET sequentially (each row separately) and then SET (wordline-wise). In the RESET phase, the power of unselected cells is gradually decreasing. In the first RESET operation, all other rows (M−1 rows) have expectancy of half of the cells at LRS. Adding a prior read to a sub-block write flow reduces power only at the reset phase, since at the SET phase all cells at the target wordline are HRS, and the read does not reveal information. The read targets only the cells that need RESET, with their specific bitline and wordline configuration, instead of performing RESET on all row cells. The first row (W1) RESET is equivalent to a single wordline RESET-before-SET with prior read:
P
W1
=P
RD-RBS-RST
At the second row RESET, there is a single row with ¾ HRS cells and (M−2) rows with random data:
The power consumption when doing RESET to the third wordline is:
At the last ith row RESET out of M rows, (i−1) rows are with ¾ cells HRS. The consumed power is:
The sum of RESET power of i rows out of M, given prior read data is:
In the SET phase, data read is not performed. The first wordline is written when all cells are in HRS in i rows and THE other (M−i) rows are with random data. In each wordline SET, the number of HRS rows decreases and random data rows increases by one. Therefore, it is equivalent to sub-block SET without prior read.
P
RD-SUB-SET
=P
SUB-SET.
The overall write power of sub-block with prior read in RESET phase is:
P
RD-SUB-R&S
=i·P
RD
+P
RD-SUB-RST
+P
RD-SUB-SET.
The result, normalized per single wordline, is:
Given a block with i out of M invalid wordlines, a first step according to an embodiment is read out the valid pages:
P
Prior-RD=(M−i)PRD
Next, an energy-efficient bitline-wise erase is performed column-by-column to all cells in the block:
According to an embodiment, after an erase, the read valid pages are grouped with additional new data for a total of M new write requests. The aggregated write is performed row-by-row. The power per wordline write grows since as more pages are written, more cells at LRS have unwanted voltage drops.
The SET power of the first wordline W1 is:
In the SET of the second worldine W2, the consumed power is:
The power at the third wordline W3 is:
At the last wordline (there are a total of M wordlines per block):
The power expressions for a SET process for wordline's index i out of M, when previous (i−1) wordlines have already been written:
Summing the accumulated power of row-by-row SET after the block had been erased is:
The overall power of consolidated copy-back and write (CBWR), given a block with i invalid pages is:
P
CBWR
=P
Prior-RD
+P
Block-ERS
+P
ERS-SET
The normalized power per wordline when writing for i wordlines IS:
A consolidation of copy-back and write operations according to an embodiment is depicted in
According to an embodiment, to determine the number of invalid pages, denoted by i, in which consolidation is more power-efficient than a sub-block write, the expressions are compared and the resulting equation is solved:
P
CBWR-WL(i)=PRD-SUB-R&S(i)
Since these equations are quadratic, there are two possible solutions, and the smaller one is taken into account since the other is out of range. The solution is illustrated in Table 3,
Algorithm 1: Consolidation of Copy-Back and Write______
Input: threshold t as calculated in Table 2
(1) choose the block with most invalid pages
(2) if # number of invalid pages >t
(2.1) read valid pages
(2.2) bitline-wise block erase
(2.3) write back valid pages and incoming write requests
(3) Else
(3.1) perform sub-block write algorithm
(4) End If
According to an embodiment, the exact power of the expressions in Table 1 are calculated and normalized according to a random wordline write with prior read.
According to an Embodiment, Up to a 40% Power Reduction is Observed as Compared to only a sub-block write algorithm and up to 60% as compared to a current random wordline write. The gain increases exponentially above a threshold. The exact gain is determined according to workload characteristics and the distribution of invalid pages in blocks.
According to an embodiment, implementation of a consolidated copy-back with write involved modifying the controller as follows.
The controller should be able to accumulate incoming write requests data and valid pages data of target block. However, this buffer is also used by sub-block write algorithms and therefore smaller changes are made over previous algorithm overhead.
According to an embodiment, the consolidation process should be managed by the controller and includes meta-data management of invalid pages in a block, which already exists in current write mechanisms, a scheduler for reading valid pages, write buffer management and allocating copy-back and write incoming pages to the target block. Additional overhead over a sub-block write is the threshold consideration.
According to an embodiment, PRAM array voltages for bitline-wise erases are a sub-group of regular RESET processes, and can be implemented in a conventional memory chip.
Expansion of a consolidation process according to an embodiment to multi-level cells does not require additional modifications. Furthermore, a consolidation process according to an embodiment can be combined with other write algorithms to increase data reliability, since the reduction of power drops over unselected cells also reduces write disturb. A consolidation process according to an embodiment is scalable as it achieves more gain, i.e. lower power, as a block size grows. The overhead can be shown to be practical, and can be reduced further if an algorithm according to an embodiment id implemented on top of a previous sub-block write scheme.
During a wear leveling process according to an embodiment, line data is copied and re-written to another line. The re-write process to the new line can detect a bad row and activate a bad-row management within a wear leveling process.
A wear-leveling (WL) algorithm according to an embodiment of the disclosure works by constantly changing the logical-to virtual (L2V) address mapping, and re-mapping the memory lines according to the new mapping. The act of changing the entire mapping one time can be referred to as a “round”, and the act of re-mapping one memory-line according to the new mapping can be referred to as a “step”. When all memory-lines are re-mapped according to the new mapping, i.e., one per step, one round is complete, the mapping changes, and a new round begins.
A mapping function according to an embodiment includes two components: a first, fixed component, and a second, which is composed on the first, changing component. As illustrated in
According to an embodiment, a bad row management (BRM) algorithm acts on top of a WL-algorithm. A BRM algorithm “filters out” the bad blocks, by mapping the virtual addresses to physical addresses that are not bad.
To enable a WL-algorithm according to an embodiment, a spare physical line is used, so there is always a free line into which to move the currently-updated line; when this currently-updated line is transferred, its previous location is the target for the next re-mapped line, and so on. However, according to an embodiment, a WL-algorithm and a BRM-algorithm work in combination: since the BRM requires over-provisioning, at any given time there is a sequence of spare lines; in each step the WL algorithm uses the last line from this sequence as the target for the newly transferred line; its previous location enters the sequence; and so on. Thus, the spare sequence moves along the physical space, and within each round the lines are updated in a descending order, as illustrated in
During the life of a PRAM, the size of the spare sequence is decreased, as new lines are declared “bad”. The failure mechanism is that due to temperature changes, the physical contact between chip circuitry and line contact becomes disconnected, so that the line itself becomes inoperative.
According to an embodiment, a step is performed according to a system timing policy.
Some possible policies are:
According to an embodiment, an L2V mapping can be defined as follows:
At round r, executing step s comprises mapping address {right arrow over (x)}∈{0,1}n:
f({right arrow over (x)})={right arrow over (x)}·P·{right arrow over (b)}+r+c·(mod 2n+z),
where:
n is the log (base 2) of the logical address space size;
P is the permutation (invertible) n×n matrix;
{right arrow over (b)}=(2n-1,2n-2, . . . , 1)T, used here to convert the binary vector to the natural number it represents;
z is the current number of spare lines; and
that is, c equals 0 if physical address {right arrow over (x)}·P·{right arrow over (b)} is not updated in the current round, and 1 otherwise.
Note that the multiplication by {right arrow over (b)} is written here for formal reasons, it does not correspond to any operation in hardware. In hardware, the result of {right arrow over (x)}·P is simply added to r+c·z.
According to an embodiment, the output of the L2V module is provided as input to a BRM algorithm, which filters out lines that are declared as bad by mapping the virtual addresses to the set of lines that are not declared bad. This means, in particular, that the size of the virtual space is exactly the number of non-bad lines.
Lines can be declared as bad either at manufacture time, or during the life of the PRAM, however, embodiments of the disclosure can handle both cases similarly.
A virtual-to-physical (V2P) address mapping is straightforward: virtual address v is mapped to the v-th non-bad line. The implementation is as follows: Let the sorted list of physical addresses of bad lines be (b1 . . . bt), where l is the current number of bad lines.
A BRM algorithm according to an embodiment maintains a list S=(s1, . . . , sl) derived from the set of bad lines as follows: the value of si is the largest virtual address mapped to a physical address below bi. Formally: si=bi−i. In addition, define s0=−1 for convenience.
Then, the virtual address v is mapped to v+i where i is the index of a maximum value s such that si<v. The value I can be found through a binary search.
The table of
According to an embodiment, the list S can be maintained as follows.
1. s0←−1
2. For i=1 to l
Input: new bad physical line b
1. Find via binary search the maximal j such that sj+j<b
2. For i=l to j+2
3. sj+1←b−(j+1)
According to an embodiment, since a BRM algorithm is based on binary search, the number of supported bad blocks is limited by the number of steps that the binary search can support. To support a larger number of bad lines, an alternative approach according to an embodiment would be to declare as “bad” units that are larger than lines. If these units have 2q lines, then q steps are saved in the binary search. These units are managed similarly as before, but over a space of size divided by 2q, and declaring a unit as bad when the first line in the unit turns bad. Denoting the previous mapping by f, the mapping then is performed as following:
Input: virtual address v
Output: physical address p
Parameter: BRM granularity q
1. q←[v/2r]
2. r←v−q
3. p←f(q)+r
It is to be understood that embodiments of the present disclosure can be implemented in various forms of hardware, software, firmware, special purpose processes, or a combination thereof. In some embodiments, the present disclosure can be implemented in hardware as an application-specific integrated circuit (ASIC), or as a field programmable gate array (FPGA). In other embodiments, the present disclosure can be implemented in software as an application program tangible embodied on a computer readable program storage device. The application program can be uploaded to, and executed by, a machine comprising any suitable architecture.
The computer system 91 also includes an operating system and micro instruction code. The various processes and functions described herein can either be part of the micro instruction code or part of the application program (or combination thereof) which is executed via the operating system. In addition, various other peripheral devices can be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures can be implemented in software, the actual connections between the systems components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
While the present invention has been described in detail with reference to exemplary embodiments, those skilled in the art will appreciate that various modifications and substitutions can be made thereto without departing from the spirit and scope of the invention as set forth in the appended claims.