The present application claims the benefit of the Singapore patent application No. 10201400292T filed on 28 Feb. 2014, the entire contents of which are incorporated herein by reference for all purposes. The present application furthermore claims the benefit of the Singapore patent application No. 10201400303Y filed on 28 Feb. 2014, the entire contents of which are incorporated herein by reference for all purposes.
Embodiments relate generally to testing apparatuses, hierarchical priority encoders, methods for controlling a testing apparatus, and methods for controlling a hierarchical priority encoder.
Finding the most similar matches to a query vector from a large database of vectors, also known as Nearest Neighbor (NN) search, is a well-known problem in audio, video and other information retrieval, particularly audio/video fingerprinting, which tries to identify a query audio/video clip from a database of reference audio/video content. Exact NN search is challenging when the vectors have high dimensions, where no indexing structure is known to be consistently faster than brute-force search. For approximate NN (ANN), commonly used methods such as Locality Sensitive Hashing (LSH) either become slow due to excessive number of hard disk seeks, or have to use an excessive amount of main memory for indexing, when the NN distance to query vector is far and the database is large. Thus, there may be a need for more efficient methods and devices.
According to various embodiments, a testing apparatus may be provided. The testing apparatus may include: a cell pair comprising two l-bit (or more generally k-state) memory cells configured to represent a stored pattern of l-bit (or more generally k-state); and a converter configured to convert a query pattern of l-bit (or more generally k-state) into at least a pair of voltages defined such that when applied to gates of the cell pair, the voltages make the cell pair into either a high resistance mode or a low resistance mode, depending on whether the query pattern matches the stored pattern. In one embodiment, the voltages make the cell pair into high resistance mode when the query pattern matches the stored pattern and into low resistance mode when the query pattern does not match the stored pattern. In another embodiment, where the cell is made of a transistor serially connected to a programmable resistive element (i.e. NGMEM such as RRAM, PCRAM, or MRAM), the voltages make the cell pair into low resistance mode when the query pattern matches the stored pattern and into high resistance mode when the query pattern does not match the stored pattern.
According to various embodiments, a hierarchical priority encoder may be provided. The hierarchical priority encoder may include a multi-match controller configured to report multiple matches in case of multiple matches.
According to various embodiments, a method for controlling a testing apparatus may be provided. The method may include: controlling a cell pair of the testing apparatus, the cell pair comprising two l-bit (or more generally k-state) memory cells configured to represent a stored pattern of l-bit (or more generally k-state); and converting a query pattern of l-bit (or more generally k-state) into a pair of voltages defined such that when applied to gates of the cell pair, the voltages make the cell pair into either a high resistance mode or a low resistance mode, depending on whether the query pattern matches the stored pattern. In one embodiment, the voltages make the cell pair into high resistance mode when the query pattern matches the stored pattern and into low resistance mode when the query pattern does not match the stored pattern. In another embodiment, where the cell is made of a transistor serially connected to a programmable resistive element, the voltages make the cell pair into low resistance mode when the query pattern matches the stored pattern and into high resistance mode when the query pattern does not match the stored pattern.
According to various embodiments, a method for controlling a hierarchical priority encoder may be provided. The method may include controlling a multi-match controller of the hierarchical priority encoder to report multiple matches in case of multiple matches.
In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments are described with reference to the following drawings, in which:
Embodiments described below in context of the devices are analogously valid for the respective methods, and vice versa. Furthermore, it will be understood that the embodiments described below may be combined, for example, a part of one embodiment may be combined with a part of another embodiment.
In this context, the testing apparatus as described in this description may include a memory which is for example used in the processing carried out in the testing apparatus. In this context, the server as described in this description may include a memory which is for example used in the processing carried out in the server. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
In an embodiment, a “circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a “circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A “circuit” may also be a processor executing software, e.g. any kind of computer program, e.g. a computer program using a virtual machine code such as e.g. Java. Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a “circuit” in accordance with an alternative embodiment.
Previously, a low-power hardware design called the interlocked design was provided to transform NAND Flash memory into a high-performance, low-power multimedia search engine. In its simplest form, it may use 2 NAND Flash cells to represent 1 bit, with a unique pair of probing voltages for testing == =“0” (in other words, for testing whether a query information is identical to “0”), and another unique pair of probing voltages for testing == “1” (in other words, for testing whether a query information is identical to “1”). The cell pair conducts if and only if probing voltage pair matches the represented bit. By concatenating m such cell pairs in a NAND string (a NAND string is a complete serial circuit of NAND Flash cells), an m-bit == test operation can be implemented, by in unique pairs of probing voltages applied to the WordLines (WLs) of the NAND string. Then, a probed NAND string will conduct or draw non-negligible current if and only if its stored data matches the entire m-bit query input. Such an m-bit (or more generally, in-component) query or reference pattern may be referred to herein as a sub-pattern.
Finding the most similar matches to a query vector from a large database of vectors, also known as Nearest Neighbor (NN) search, is a well-known problem in audio, video and other information retrieval, particularly audio/video fingerprinting, which tries to identify a query audio/video clip from a database of reference audio/video content. Exact NN search is challenging when the vectors have high dimensions, where no indexing structure is known to be consistently faster than brute-force search. For approximate NN (ANN), commonly used methods such as Locality Sensitive Hashing (LSH) either become slow due to excessive number of hard disk seeks, or have to use an excessive amount of main memory for indexing, when the NN distance to query vector is far and the database is large. According to various embodiments, efficient methods and devices for finding most similar matched may be provided.
According to various embodiments, the at least one cell 104 may include a plurality of transistors, each of the transistors connected to a corresponding resistance. According to various embodiments, the control circuit 106 may be configured to selectively shortcut at least one of the resistances to which the plurality of transistors correspond based on the query input data.
According to various embodiments, the at least one cell 104 may include a first transistor connected to a first resistance. According to various embodiments, the at least one cell 104 may include a second transistor connected to a second resistance. According to various embodiments, the control circuit 106 may be configured to selectively shortcut the first resistance or the second resistance based on the query input data.
According to various embodiments, a “0” may be stored as a (H L) pair in the first transistor and the second transistor, where L denotes low-resistance state, and H denotes high-resistance state.
According to various embodiments, a “1” may be stored as a (L H) pair in the first transistor and the second transistor, where L denotes low-resistance state, and H denotes high-resistance state.
According to various embodiments, the first resistor may be connected with a first MOSFET in parallel. According to various embodiments, the second resistor may be connected with a second MOSFET in parallel.
According to various embodiments, the first MOSFET is a first nMOSFET. According to various embodiments, the second MOSFET is a second nMOSFET. According to various embodiments, for query input data equal to “0”, a hi voltage may be applied to the first nMOSFET, and a lo voltage may be applied to the second nMOSFET. According to various embodiments, hi may be a voltage high enough to make the first nMOSFET turn ON, and lo may be a voltage low enough to make the second nMOSFET turn OFF;
According to various embodiments, the first MOSFET may be a first nMOSFET. According to various embodiments, the second MOSFET may be a second nMOSFET. According to various embodiments, for query input data equal to “1”, a lo voltage may be applied to the first nMOSFET, and a hi voltage may be applied to the second nMOSFET. According to various embodiments, hi may be a voltage high enough to make the second nMOSFET turn ON, and lo may be a voltage low enough to make the first nMOSFET turn OFF.
According to various embodiments, the memory circuit may include at least one circuit selected from a list of circuits consisting of: of a NAND flash architecture; a NOR flash architecture; a 2-transistor source-select NOR flash cell; a Ss-CHE split-gate NOR flash cell; and a SuperFlash v1-2 or v3 NOR type cell.
According to various embodiments, the server 112 may further include a hierarchical priority encoder (not shown in
According to various embodiments, the at least one cell may include a plurality of transistors, each of the transistors connected to a corresponding resistance. According to various embodiments, the testing method may further include selectively shortcutting at least one of the resistances to which the plurality of transistors correspond based on the query input data.
According to various embodiments, the at least one cell may include a first transistor connected to a first resistance. According to various embodiments, the at least one cell may further include a second transistor connected to a second resistance. According to various embodiments, the testing method may further include selectively shortcutting the first resistance or the second resistance based on the query input data.
According to various embodiments, a “0” may be stored as a (H L) pair in the first transistor and the second transistor, where L denotes low-resistance state, and H denotes high-resistance state.
According to various embodiments, a “1” may be stored as a (L H) pair in the first transistor and the second transistor, where L denotes low-resistance state, and H denotes high-resistance state.
According to various embodiments, the first resistor may be connected with a first MOSFET in parallel. According to various embodiments, the second resistor may be connected with a second MOSFET in parallel.
According to various embodiments, the first MOSFET may be a first nMOSFET. According to various embodiments, the second MOSFET may be a second nMOSFET. According to various embodiments, for query input data equal to “0”, a hi voltage may be applied to the first nMOSFET, and a lo voltage may be applied to the second nMOSFET. According to various embodiments, hi may be a voltage high enough to make the first nMOSFET turn ON, and lo may be a voltage low enough to make the second nMOSFET turn OFF;
According to various embodiments, the first MOSFET may be a first nMOSFET. According to various embodiments, the second MOSFET may be a second nMOSFET. According to various embodiments, for query input data equal to “1”, a lo voltage may be applied to the first nMOSFET, and a hi voltage may be applied to the second nMOSFET. According to various embodiments, hi may be a voltage high enough to make the second nMOSFET turn ON, and lo may be a voltage low enough to make the first nMOSFET turn OFF;
According to various embodiments, the memory circuit may include at least one circuit selected from a list of circuits consisting of: of a NAND flash architecture; a NOR flash architecture; a 2-transistor source-select NOR flash cell; a Ss-CHE split-gate NOR flash cell; and a SuperFlash v1-2 NOR type cell.
According to various embodiments, l may be equal to 1. According to various embodiments, the cell pair 132 may include at least one of 1-Tr NOR Flash, 2TS NOR Flash default, 2TS NOR Flash with mid-only voltage to word lines, SuperFlash v1-2, SuperFlash v3, or NGMEM (e.g. RRAM, PCRAM, or MRAM).
According to various embodiments, l may be an integer number larger than 1. According to various embodiments, the cell pair 132 may include at least one of 1-Tr NOR Flash, 2TS NOR Flash default, SuperFlash v1-2, SuperFlash v3, or NGMEM.
According to various embodiments, the hierarchical priority encoder 138 may further include a merging circuit (not shown in
According to various embodiments, the multi-match controller 140 may be configured to report multiple matches by clearing a previously reported match after each report.
According to various embodiments, the multi-match controller 140 may be configured to provide a hierarchically back-traverse mechanism.
According to various embodiments, the multi-match controller 140 may be configured to provide a general column-ID to N decoder.
According to various embodiments, the hierarchical priority encoder 138 may be configured for multi-array operation.
According to various embodiments, the hierarchical priority encoder 138 may be configured for multi-chip operation.
According to various embodiments, l may be equal to 1. According to various embodiments, the cell pair may include or may be at least one of 1-Tr NOR Flash, 2TS NOR Flash default, 2TS NOR Flash with mid-only voltage to word lines, SuperFlash v1-2, SuperFlash v3, or NGMEM.
According to various embodiments, wherein l may be an integer number larger than 1. According to various embodiments, the cell pair may include or may be at least one of 1-Tr NOR Flash, 2TS NOR Flash default, SuperFlash v1-2, SuperFlash v3, or NGMEM.
According to various embodiments, the method may further include controlling a merging circuit to provide hierarchical merging.
According to various embodiments, the multi-match controller may report multiple matches by clearing a previously reported match after each report.
According to various embodiments, the multi-match controller may provide a hierarchically back-traverse mechanism.
According to various embodiments, the multi-match controller may provide a general column-ID to N decoder.
According to various embodiments, the hierarchical priority encoder may provide multi-array operation.
According to various embodiments, the hierarchical priority encoder may provide multi-chip operation.
According to various embodiments, a low-power design using Vpre (instead of Ground) level shielded Bit-line sensing for NAND Flash may be provided.
According to various embodiments, an interlocked design for NAND architecture of NGMEM may be provided.
According to various embodiments, a way of converting 2TS NOR Flash to NAND Flash while not requiring process re-engineering may be provided.
According to various embodiments, scalable Fuzzy search systems may be provided.
NAND Flash cells are floating gate transistors, which has the notion of threshold voltage Vth (for example as viewed from its Control Gate). If the, applied voltage to the cell's Control Gate (i.e., WL) VCG is below Vth, the cell does not conduct, i.e., draws very little current. The cell's current grows (at least substantially; in other words: roughly) exponentially with respect to VCG, until VCG becomes much larger than Vth. By contrast, many of the next-generation memories (NGMEM) such as RRAM (Resistive RAM), PCRAM (Phase-Change RAM), MRAM (Magnetic RAM), are inherently resistive devices with programmable resistance, as opposed to a transistor with programmable threshold voltage. Although a transistor is often used together with the resistive element in such memories, the transistor serves only as a selector switch and generally has no programmable Vth. Therefore, even if a relatively low input voltage is applied, generally to the bit-line (BL) instead of the WL, a non-negligible current generally may still flow through the resistive element even if it is in a high resistance state (unless the high resistance is very high).
In conventional RRAM, PCRAM, or even MRAM, within each column their cells follow a parallel layout similar to DRAM or NOR Flash. Now if their cells are instead concatenated to follow a NAND/serial layout, this serial circuit may also be called a NAND string, then we are measuring the sum of resistance across all cells in such a NAND string. Suppose a low resistance state L has resistance RL, and high resistance state H has resistance RH.
If we want to use the interlocked low-power design, for example by using a (H, L) cell state pair to represent a “0”, and using a (L,H) cell state pair to represent a “1”, then we have difficulty distinguishing between a “0” and “1” if we only observe the BL (bit-line) current (or its corresponding BL voltage). This is because the 2 select transistors in the cell pair both need to be ON to test each cell's resistance state, and yet the total resistance is the same for both represented “0” and “1”: RL+RH (assuming select transistors have equivalent resistance <<RL in the ON state).
According to various embodiments, an interlocked design may be provided, for example for next-generation memories.
In the following, a baseline case of one cell pair according to various embodiments will be described.
To resolve the above-mentioned ambiguity, we can selectively “by-pass” one of the two resistive elements in the cell pair. We can add a “by-pass” transistor in parallel connection to the resistive element in the cell. So for each cell pair there will be 2 “by-pass” transistors. It is to be noted that, to save input pins, we can borrow from the concept of interlocked design, and use 1 nMOSFET and 1 pMOSFET as the 2 “by-pass” transistors and with a common control voltage input referred to as Probe or Query.
This is illustrated in
To test for == “0”, Probe=3V (high voltage) is used. It will turn on T2 and by-pass top cell's resistive element R1, Yet 3V will turn off the pMOSFET T4 (assume VDD≦3V), so only bottom cell's resistive element. R3 will be measured. Assuming the select and by-pass transistors have much lower resistance than RL, if == “0” is true, then we get NAND string BL current I≈(VDD−VSS)/RL. If == “0” is false, I≈(VDD−VSS)/RH. For RRAM, which can have a fairly high 100:1 resistance ratio or above, this will result in a 100:1 current ratio or above, which may be easy to distinguish. Plus, the non-matching cell pair will draw much less current, similar to NAND Flash interlocked design where non-matching cell pair draws almost zero current: The design in
To test for == “1”, Probe =0V (low voltage) is used. It will turn off T2, but will turn on T4 and bypass R3. Therefore, the top cell's resistive element R1 will be measured. If == “1” is true, I≈(VDD−VSS)/RL. If == “1” is false, I≈(VDD−VSS)/RH. Therefore, for both == “0” and == “1” tests, a match corresponds to a large current and no-match corresponds to a small current.
In the following, advanced uses according to various embodiments, for example multi-bit == tests and transistor count minimization, will be described.
Multiple cell pairs may be concatenated in series to support == test for multiple bits. If all n bits in a pattern match and the NAND string is n pairs long, then BL current I≈(VDD−VSS)/(n*RL); otherwise, I≧(VDD−VSS)/(RH+(n−1)*RL). If cells haves 100:1 resistance ratio, then current differentiation will still be fairly good for n=32.
It is to be note that T2 and T4 in
Furthermore, because T1 and T3 are always fed with 3V (high voltage), they can actually be omitted without causing any trouble. If there are multiple NAND strings per column/BL (often the case), then we only need a T1 (one select transistor) per NAND string to prevent unwanted current from unprobed NAND strings.
In the following, extensions according to various embodiments to the interlocked design will be described, for example illustrating how to allow data initialization and modification.
The new interlocked design in
For example like shown in
In the following, weak-bit representation according to various embodiments will be described.
For media fingerprinting or other applications of nearest neighborhood search, the concept of “weak-bits” has been introduced to represent bits that are most likely to have flipped from original to query within a codeword. Typically, to improve the robustness of the search algorithm those “weak-bits” are ignored during matching operation. “Weak-bits” can be identified by fingerprinting generation algorithm during database generation (database- or reference-side weak bits) or during query generation (query side weak bits). Pattern matching with weak-bits is supported natively in the NAND Flash interlocked design, with the advantage that no enumeration of weak bits (2w enumerations for w weak bits) is needed, and the pattern match can be done in just one NAND Flash access cycle.
Weak-bits can be implemented using the interlock design illustrated in
Therefore, to test for == “0x” in
The resistive elements may support MLC (multi-level cell) by different levels of resistance. This may be used to provide fuzzy pattern matching, although the exact functionality may be different from weak ranges or range quantizers in NAND Flash based interlocked design.
In the following, generalizations for other embodiments will be described.
It is to be noted that.
If the equivalent resistance of the select and/or by-pass transistors is non-negligible, such equivalent resistance can be estimated and incorporated into the calculation of the nominal current value for each test result, e.g., the true or false result for a == test operation. The word “equivalent” and “estimate” are used here because transistors have a nonlinear relationship between its VCG and current, thus a changing resistance with respect to its bias conditions. The best estimation of such a transistor's equivalent resistance at the expected bias condition will result in the best estimation of nominal current, and hence how “distinguishable” various test results are among each other.
According to various embodiments, a method for performing == test operation using query input data against stored data may be provided,
where stored data are stored in resistive memory devices; and/or
where a “0” is stored as a (H L) pair, and a “1” is stored as a (L H) pair, where L denotes low-resistance state with resistance RL, and H denotes high-resistance state RH; and/or
where the 2m resistive elements of the 2m resistive memory devices are concatenated in series to form a NAND string; and/or
where each of the 2m resistive elements is connected with a MOSFET in parallel; and/or
where an m-bit == test operation is divided into m 1-bit == test operations, and a 1-bit == test operation involves generating a pair of voltages to the Gate terminals of the two MOSFETs corresponding to the pair of resistive elements being tested; and/or
where for the case of only nMOSFETs are being used for parallel connection to the resistive elements, for == “0”, a (hi, lo) voltage pair is used, and for == “1”, a (lo, hi) voltage pair is used, where hi is a voltage sufficiently high to make the nMOSFET turn ON, and lo is a voltage low enough to make the nMOSFET turn OFF; and/or
where the NAND string is applied a voltage drop of (VDD−VSS) and Id is the current flowing through the serial circuit of resistive elements, and == test operation is declared TRUE if and only if I≈(VDD−VSS)/(m*RL); and/or
where the “0” and “1” representations, the choice of nMOSFET vs. pMOSFET, are swapped according to the “duality” paradigm; and/or
where a (hi, hi) voltage pair is used to implement a query-side don't care bit; and/or
where a (L L) pair is used to implement a reference-side don't care bit.
According to various embodiments, various ways of implementing the interlocked design may be provided, augmenting it with essential hardware components, and extending it onto more versatile hardware architectures, in order to create a highly scalable, very low power fuzzy search system.
In the following, adaption of interlocked design to more hardware platforms according to various embodiments will be described.
In the following, adapting NOR flash cells to NAND flash architecture according to various embodiments will be described.
In the following, implementing NAND flash on standard logic CMOS process will be described.
The interlocked design may require modifying NAND Flash, thus requiring semiconductor process support for NAND Flash. However, native NAND Flash process support is not widely available, especially among semiconductor foundries. Therefore, it is desirable to effectively create NAND Flash process support from standard logic CMOS processes. Standard logic CMOS processes generally has at least 1 polysilicon (also known as poly) layer and supports MOSFETs of both n-channel and p-channel type.
Individual Flash cells have been created using standard logic CMOS processes, where the working principle is: (1) degenerate a pMOSFET into a capacitor by shorting its Drain, Source, Bulk; (2) connecting the Gate of the pMOSFET to the Gate of an nMOSFET using poly layer to form a floating gate (FG); (3) the shorted Drain, Source, Bulk of the pMOSFET then becomes the Control Gate (CG) of the newly formed Flash cell. This is illustrated in
Commonly, only individual Flash cell operations or NOR Flash based operations are described. To create. NAND Flash out of such cells,
The cell in
if α=Cgp/Cgn<1, NCHE write and PFN erase is used;
if 1<=α<=3, NCHE write and NFN erase is used;
if α>3, NFN write and NFN is used.
An NFN erase requires applying a high erase voltage at the Drain and Source of the nMOSFET, but the NAND string configuration in
Therefore, it may be desirable to create a new type of Flash cell that can take advantage of both NFN write and PFN erase in a NAND configuration. This is illustrated in
For read operation, both CGs may use the same (or similar) voltage Vread, then it will have same or similar high coupling ratio as in the NFN write case, except Vread is generally noticeably smaller than Vprog. Also, in read mode the Drain of nMOSFET is set to a low voltage such as Vdd and Source of nMOSFET to Ground/0V. To implement multi-level cells (MLCs), multiple values of Vprog and corresponding Vread may be used. For interlocked based query operation, it is treated as if it were a read operation, except that each word-line may have its unique voltage, whereas in read for NAND Flash only the row being read has a voltage lower than a pass voltage, where the pass voltage is high enough to ensure conductance of the cell irrespective of the cell's state.
Of course, in program and read operations, voltages at CG and CG′ may be different, as long as it achieves the desired FN tunneling effect (for program) or accurate enough readout (for read). For erase operations, voltages at CG′ need not be 0V, as long as it achieves the desired erase effect. The voltages at Drain and Source of nMOSFET may also be adjusted from the nominal values described above, as long as the circuit still achieves the desired functionality. In addition, more than two pMOSFETs may be used for each such Flash cell, and by calculating the capacitive coupling from each pMOSFET to the cell's nMOSFET, a set of voltages for these pMOSFETs' CG in the cell may be determined to achieve the desired FN tunneling effect for program and for erase, using the same principle of high capacitive coupling ratio to Vprog during program, and low capacitive coupling ratio to Verase during erase.
The trade-off of the above CMOS-based NAND Flash implementation includes a larger area per cell, because each pMOSFET in each such cell may require its own n-well, and the minimum spacing between n-wells in order to meet practically any CMOS process' design rule is substantial. This area penalty can be reduced by laying out the cells more efficiently, for example, using the approaches according to various embodiments described next.
Another approach to reducing area overhead is by sharing the n-well across more than two (up to all) cells in a row, where multiple first pMOSFETs (CG) in a row share a horizontal n-well, and multiple second pMOSFETs (CG′) in a row share another horizontal n-well, as illustrated in
Because with this approach the nMOSFETs in the same column but in adjacent rows are now separated by the horizontal n-wells, metal layer wiring will be needed between such nMOSFETs in order to form a NAND string, as shown by the long wires in
In the following, implementing NAND Flash with 2-Transistor Source-Select (2TS) NOR Flash Cells according to various embodiments will be described.
Conventional NOR Flash based on 1-Transistor Flash cells can be re-arranged to a NAND layout to implement NAND Flash, assuming operating voltages can be adjusted accordingly and still fall within the safe ranges supported by the underlying NOR Flash semiconductor process. Some NOR Flash memories are based on a 2-Transistor Source-Select (2TS) Flash cell design, where 1 MOSFET serving as a select transistor connecting to the Source line and 1 floating-gate transistor serving as the storage element, forms a cell. The select transistor is used to deal with “over-erase” problem in NOR Flash, where an excessive erase may decrease a floating-gate transistor's Vth below the voltage applied to unselected rows (e.g., 0V), and cause unselected cells to drain current from the bit-line and interfere with the read-out of the select row's cell.
If we assume a CG to FG coupling ratio CR of say 0.65, and a Tox of say 11 nm, the initial FG voltage (if the cell is initially charge-neutral) and initial FN tunneling field can be estimated as stated in Table 1.
In this case, Gate Disturb and Drain Disturb are fairly small, because FN tunneling current reduces exponentially with respect to tunneling field, and a reduction of 4 MV/cm in field (compared to the tunneling field in the to-be-programmed cell) will likely lead to a reduction in tunneling current by 106 to 108 times.
However, when adapting the above 2TS NOR Flash architecture to NAND, as illustrated in
Especially, to meet requirement (b), the following must hold:
V
pass×X CR+VBL_unsel×(1−CR)+ΔVprog≦Vth
where ΔVprog is the FG voltage at 0-bias when the cell is programmed (i.e., has an excess of electrons), and Vth_fg is the threshold voltage of the floating-gate transistor when viewed from the point of FG (instead of from the usual viewpoint of Control Gate CG), i.e., how much VFG−VS is needed to make its channel conduct. If we assume ΔVprog=3V and Vth_fg=0.7V, then we get Vpass≧9.7V. When this Vpass is applied to the selected column, it will generate a fairly high tunneling field, causing strong program disturbs, as shown in Table 2 below.
As shown in Table 2, with the above assumed operating values, program disturb on unselected row in selected column will be 8.1 MV/cm, too close to the 9.5 MV/cm of the intended cells. Yet the requirement of Vpass≧9.7V is needed to ensure the channel potential in unselected column(s) equalize to VBL_unsel. If Vpass is reduced, there is either the likelihood of the channel potential on unselected columns) meeting VBL_unsel which may be needed to suppress program disturbs on unselected columns, or even worse, a lower Vpass may have the effect of self-boosted program inhibit, which will increase channel potential on unselected columns to much higher than VBL_unsel. Although this will reduce program disturbs, it will raise both channel and drain/source potential, possibly to the point of junction breakdown. If it is required that no semiconductor process change (especially in junction voltage engineering) is needed (e.g. to reduce both NRE time and cost of process engineering), then a lower Vpass cannot be used for chip reliability concerns.
In the following, a way according to various embodiments to solve this problem will be described, as illustrated in
Instead of always using a high Vpass, we first apply a Vpass hi which meets equation (1), e.g. 10V, and also apply VBL_unsel (or a voltage noticeably higher than VBL_sel) to the selected bit-line, and wait for the channel potentials on unselected column(s) to stabilize to VBL_unsel (or whatever voltage the selected bit-line is hereby first applied). Then, reduce the voltage(s) on unselected row(s) from Vpass_hi to a Vpass_lo which meets Vpass_lo×CR+VBL_sel×(1−CR)+ΔVprog≧Vth_fg+VBL_sel, e.g. 2V, and also increase the selected bit-line's voltage to VBL_sel, and wait for the actual cell programming to take place. By applying VBL_unsel to the selected bit-line, program disturb field is reduced to only ˜3.4 MV/cm, and after the channel potentials on the unselected column(s) stabilize/equalize to VBL_unsel, then Vpass_lo and VBL_sel are applied, and the program disturb field on unselected row, selected column would still be kept reasonably low, e.g. in this case to ˜3.5 MV/cm if Vpass_lo=2V. When Vpass reduces from Vpass_hi to Vpass_lo, due to capacitive coupling the channel potentials on unselected column(s) may also decrease, but such decrease will neither cause appreciable increase in unwanted tunneling field because FG to channel voltage drop will generally decrease due to capacitive coupling, nor lead to any junction breakdown since the junction voltage drop will only decrease when channel potential decreases. For word-lines below the selected row, the voltages may be set to a value≦Vpass_lo, so that the cells on these word-lines do not get noticeable program disturbs. Note that the voltage values shown in
In the following, implementing NAND Flash with SS-CHE Split-Gate (1.5-Transistor) NOR Flash Cells according to various embodiments will be described.
Another important type of NOR Flash design is the split-gate, also known as 1.5-Transistor cell design, where half of the cell functions as a select transistor, and the other half as the floating-gate transistor. Such design generally uses the much more power-efficient Source-Side Channel Hot Electron (SS-CHE) injection (also known as Source-Side Injection or SSI) for cell programming.
In all SS-CHE split-gate NOR Flash cell designs, there is a word-line gate immediately on top of the channel at the. Drain side, and a floating gate immediately on top of the channel at the Source side. To program such a cell, a high voltage VS_pgm_NOR is applied at the Source and a VD_pgm_NOR≈0V is applied at the Drain, and the word-line is applied a VWL_pgm_NOR which slightly turns on the channel immediately beneath the word-line gate. VD_pgm_NOR may also be generated by a small current source instead of being a fixed voltage. During read, Vref1, typically Vcc, is applied to the word-line, and Vref2, usually around 1V, is applied to the bit-line, which is the Drain side of the cell. In SuperFlash v3, as illustrated in
In the following, low-power techniques for implementing interlocked design according to various embodiments will be described.
In the interlocked design, a NAND string conducts only if its represented data pattern matches the query data pattern. The presence (or lack of) of the. NAND string's conductive state can be measured by a sense-amplifier. Any sense-amplifier designed for conventional NAND Flash read operation may be used, since all such sense-amplifiers are designed to test whether a NAND string conducts. For low-power operation, voltage-based sense-amplifiers may be preferable to current-based sense-amplifiers, since no reference current is needed in a voltage-based sense-amplifier, and having a reference current for each column/bit-line may incur non-negligible power overhead. A voltage-based sense-amplifier may work by first pre-charging the measured NAND string's belonging bit-line to a pre-defined voltage Vpre (e.g. Vcc), then float the bit-line from the Vpre input, and then apply corresponding word-line voltages to test NAND string conductivity by checking whether the bit-line's voltage has decreased to below a certain level. If the string is not conductive, the bit-line voltage will still be almost the same as Vpre. If the string is conductive, the bit-line will gradually discharge to ground and its voltage will measurably decrease by the end of the sensing time window. One such voltage-based sense-amplifier uses a double-inverter based latch, where the pre-charging stage forces the latch to an initial state, and if the NAND string conducts and bit-line discharges, once beyond the trip point of the inverter, the latch will toggle and reach a new bi-stable state. Therefore the latch state corresponds to the NAND string's conductivity state.
Due to potentially high parasitic capacitive-coupling interference between adjacent bit-lines in NAND Flash, the Shielded Bit-line sensing method may be used to suppress such interference, by pre-charging and then sensing the even bit-lines first while simultaneously grounding all odd bit-lines, followed by pre-charging and then sensing the odd bit-lines first while simultaneously grounding all even bit-lines (or vice versa). As illustrated in
If Shielded Bit-line sensing has to be used instead of ABL architecture, the shielding scheme can be modified from ground-shielding to pre-charge level shielding to make it low-power. That is, when pre-charging and sensing the even bit-lines, the odd bit-lines are also pre-charged to the same pre-charge voltage Vpre; but during sensing the odd bit-lines will be held at the Vpre input, instead of being floated from the Vpre input and tested for any discharge as in the even bit-lines. Assuming that most bit-lines don't match the query input, then only very few odd bit-lines will draw current during sensing. Then, when pre-charging and sensing the odd bit-lines, the even bit-lines are also pre-charged to Vpre, but will be held at the Vpre input, instead of being floated from the Vpre input and tested for any discharge as in the odd bit-lines. This is illustrated in
In the following, adapting interlocked design to NOR Flash architecture will be described.
Although the interlocked design may be based on NAND Flash, in the following, a method of adapting it to NOR Flash architecture according to various embodiments will be described. Instead of only a matching NAND string's bit-line will conduct and draw current, with the NOR adaptation only a mismatching column's bit-line will conduct and draw current, and consequently only a matching column's bit-line will not draw current.
In the following, a 1-bit Case (and extension to Next-Generation Memories) according to various embodiments will be described.
Read and query sensing can be done by either voltage, or current. If by voltage, generally the bit-line is pre-charged to a given level Vpre (typically Vcc or Vdd), then the bit-line is floated from Vpre, and the word-lines are probed with corresponding voltages, and the sense amplifier tests for presence of discharge on the bit-line to determine presence of current flow, same as explained above. Alternatively, current based sense amplifiers, such as described above may be used.
Although
Because each bit-line of a typical NOR Flash cell array may attach many cells, for cell pair(s) not participating in a particular pattern match, then their corresponding word-lines should be applied low enough voltage(s) (e.g. lo) to guarantee non-conductivity in the cell channel irrespective of the cell state, so that they don't contribute bit-line current spuriously. For example, if there are 32 cell pair(s), i.e. 64 cells attached on a bit-line, and query pattern corresponds to only top 16 bits, then the bottom 16 cell pairs' word-lines can all be applied lo. In addition, for 2TS NOR Flash (e.g.
By treating SuperFlash v1-2 cells as if they are 2TS NOR Flash cells like in
Weak bits, also known as don't care bits, can also be implemented in the NOR adaptation of interlocked design. A (programmed, programmed) cell pair may be used to implement a reference-side weak bit, because both (lo, mid) and (mid, lo) will not be able to make either of the two cells conduct, thus designating a matched query bit. Although not allowed in
In addition to adapting the interlocked design to NOR Flash architecture, it can also be adapted to next-generation memories (NGMEM), such as PCRAM (Phase Change), RRAM (Resistive), and MRAM (Magnetic). The basic characteristic of NGMEM is a programmable resistor connected in series to a select transistor, where the resistance state (low resistance vs. high resistance) may be changed by applying certain signals (e.g. voltages or for MRAM a current with a certain electron spin) on the bit-line. As illustrated in
In addition, a (RH, RH) cell pair may be used to implement a reference-side weak-bit, because it will draw a small current of VBL/RH per cell pair, irrespective of input (lo, mid) or (mid, lo). Similarly, a (lo, lo) may be used to implement a query-side weak-bit, because it will always draw no current. However, this no current more accurately speaking is the cell leakage current when applied (lo, lo), and is almost zero, which makes it slightly different from VBL/RH (the match current for 1 cell pair without query-side weak-bit) especially when RHis not very large, therefore the sense amplifier may need to take into account the existence of query-side weak-bit to use a proper reference current level for sensing.
In the following, a multi-bit and range query case according to various embodiments will be described.
To extend the interlocked design of NOR Flash to multi-level cells (MLCs), for convenience of description, we use the opposite encoding convention to
Then, it can be proven that the above scheme implements the multi-bit exact match for NOR Flash, including for l=1. More generally, if the reference cell pair is (a, 2l−b−1), and the query pair is (x, 2l−y−1), then it is testing for the expression x≦a && y≦b. This may be used to implement complex search functionalities such as range query, similar to the range query, but with different mappings, because the direction of the inequality operators for x vs. a, and y vs. b may be opposite compared to those commonly used. The mappings for NOR Flash is illustrated in
Also, instead of an l-bit cell, more generally a k-state cell may also be used, simply by replacing 2l with k in the interlocked notation for l-bit cell, including various, forms of range query in
Again, for cell pairs not participating in the pattern match, their corresponding word-lines should be applied a low enough voltage, e.g. f(0), such that none of these cells can conduct irrespective of their cell states.
Although a monotonically increasing f(i) is used in this section, monotonically decreasing f(i) may also be used provided the cell state definition is reversed such that state 0 is the most programmed and state 2l−1 is erased. Also, instead of n-channel Flash cells which are the default here, p-channel Flash cells may also be used. P-channel Flash cells implements a <=logic instead of n-channel's >=logic. The conversion of this section's NOR Flash interlocked design to p-channel Flash can be done following the same procedures for porting NAND Flash interlocked design to p-channel Flash, and should be familiar to those skilled in the art of p-channel Flash. Similarly, notation convention of what encodes/represents a “0” vs. “1”, and what probing voltages corresponding to a query test of “== 0” vs “== 1”, may be swapped for
With the NOR adaptation of the interlocked design, most columns would have current flow because most columns will likely be mismatched, and this could lead to significantly higher power consumption compared to the. NAND version of the interlocked design. To curb power consumption, one may use type(s) of sense-amplifier(s) with early mismatch detection, i.e., detecting a mismatched column (which would have a relatively high mismatch current) early on in the sensing cycle and then immediately cut off current flow to such a column.
In the following, interlocked design without double storage requirement according to various embodiments will be described.
The interlocked design and its extension to NOR-Flash architecture described above all use two l-bit (or more generally k-state) cells to represent an l-bit (or more generally k-state) value or range. According to various embodiments, a method of using only one cell instead of two cells may be provided to achieve the same functionality of == test without actually reading the cells. That is, if the == test is false, the accessing circuit does not necessarily know what value is stored in those cells. This “not necessarily know” characteristic is similar to the interlocked design and its extension to NOR-Flash as described above.
In the following, a NOR flash case according to various embodiments will be described.
The f(i−1) pulse of WL1 will drain the bit-line's pre-charged level from Vcc−Vtn to 0V, if cell 1 state (denoted S1) is i−1 or smaller, because the cell would have conducted. Because C2 is still held high, the draining/discharging of the bit-line will also cause the parasitic capacitor at Gate of T4 to discharge, also from Vcc−Vtn to 0V. This implies T4 will not turn on afterwards (until the next read/query cycle). Note while C2 is held high, the pMOSFET T3 will remain off.
After the f(i−1) pulse of WL1 and any potential discharging of bit-line and T4G is complete, C2 is then held low (which would turn on T3), and WL1 is applied a voltage of f(i), so if cell state S1>i, the cell will not conduct. If S1=i the cell will conduct, and since C2 is now low implying T2 is now off, VT4G will remain at Vcc−Vth instead of discharging to 0V, keeping T4 on. Then conducting current I3 is compared against a reference current Iref by a current-based sense amplifier, which can then report a logic output of whether I3>Iref. Because I3 requires a voltage source, an implicit Vcc may be contained inside the sense-amplifier, as illustrated by the dashed line
The method in
Similar to range query for one cell, multiple cells can also be probed with query-side range query. To test whether a cell j (on row j)'s state Sj ∈ [xj,yj], its corresponding WLj can be first applied f(xj−1), then applied f(yj) and tested for presence of current. Compared to the more strict == qj test, a range query is not only more relaxed in matching constraint, but also may generate a more diverse (i.e. more widely distributed) levels of matching current. This is because for any == i test, if f(i)=(Vth(i)+Vth(i+1))/2, then the word-line voltage is exactly ΔV/2 higher than Vth(i), where ΔV=Vth(i+1)−Vth(i), and ΔV is typically same or similar for all i's. This implies that the conducting/matching current will be similar across all i's, e.g. I0. Whereas in range query, during 2nd WL pulse, if matched, WLj−Vth(Sj) may be much higher than ΔV/2, and the matching current may be much higher than I0. Or, the matching current may be just I0. Then, the total bit-line current where all m cells match may span from (m−1)·I0 to much higher, and where m−1 cells match may span from (m−1)·I0 to much higher, and note the two current ranges will generally overlap. Therefore, it may become challenging to accurately determine whether all m cells matched.
In the following, a NAND flash case according to various embodiments will be described.
For NAND Flash, the 1st WL pulse has to be applied to each word-line without overlapping in time. In addition, when applying the 1st WL pulse of voltage f(xj−1) on row j, then all other word-lines must be supplied a hi voltage where hi must ensure cell conductance irrespective of cell state. If the bit-line did not discharge after testing all probed cells in the 1st WL pulse, then it can be concluded that Si>=xj. Then, when applying the 2nd WL pulse of voltage f(yi) on row j, all probing word-lines can be applied simultaneously instead of sequentially. Then, if the bit-line conducts, it can be concluded that Sj<=yj, hence Sj ∈ [xj,yj]. The disadvantage of this method for NAND Flash is the long delay, a random access cycle required for each probing word-line during the 1st WL pulse.
In the following, a memory architecture suitable for writing data in column-wise manner according to various embodiments will be described.
In applications where the fuzzy search database does not change frequently, conventional write operations, e.g., writing in page-wise manner where a page is generally a row of memory cells, may be used. However, in cases where the database needs to change or update frequently, especially if the reference data patterns become available in a real-time streaming fashion, it maybe more time-efficient to write data in a column-wise manner, because waiting for reference data patterns to accumulate to the point of filling the whole memory array may incur undesirable latency. Next we show how to adapt NOR, NAND and next-generation memory architectures, so that reference data patterns can be written to the array in a natively column-wise manner. In addition, such native support may also support column-wise erase or reset operations natively, so that the database may be updated in-place incrementally, without having to erase an entire block before updating (a limitation usually found in NAND and NOR Flash memories).
In the following, adaption for SuperFlash v1-2 NOR Type according to various embodiments will be described.
In conventional SuperFlash v1 and v2, in the cell array Source diffusions in the same row are typically extended and merged together to form a Source line, and only up to 1 row of cells are programmed at a time, with the selected row's Source line applied 8-10V and other Source lines applied 0V, as illustrated in
In the adapted architecture, in the cell array a Source line is merged from Source diffusions in the same column, and each word-line may be applied a non-0V voltage for programming, and the column selected for programming is applied a bit-line voltage of ˜0V, and other bit-lines are applied Vcc to inhibit programming on unselected columns. This is illustrated in
If each Source line can be independently controlled, then conventional SuperFlash would allow page-wise (row-wise) erase, as opposed to having to erase by the whole block. When adapted to the simultaneous column-wise programming method illustrated in
It is to be noted that it is also possible to connect all Source lines (whether they are horizontal or vertical lines) in an array together all the time, and the scheme in
The merging of Sources into the Source line may be realized by diffusion extensions, as illustrated in
In the following, a highly scalable and hierarchical priority encoder for reporting matches according to various embodiments will be described.
In the following, a hierarchical design and efficient logic implementation according to various embodiments will be described.
Both the original (one projection compared at a time) and enhanced (multiple projections compared at a time) vote count algorithm described above may increment a vote counter ci for each column i upon each sub-pattern match (whether such a sub-pattern corresponds to a single projection/dimension or multiple projections/dimensions). The columns whose vote counter exceeding or meeting a specified threshold T (i.e. ci>=T) are then considered candidate matches and their column IDs (i.e. index numbers) should then be reported using a priority encoder. Such a priority encoder has N inputs, with a 1 indicating a candidate, 0 otherwise, and it should report whether there is any candidate, and if so, the column IDs of all or part of the candidates. Because the vote count algorithm is intended for large databases, the number of columns N may be very large, making conventional priority encoder (PE) design inefficient. Also, most conventional PEs can only report 1 candidate match.
According to various embodiments, a hierarchical priority encoder may be provided, which has a highly scalable design. According to various embodiments, tie-breaking decision may be made in a hierarchical instead of global manner. This is shown in
P
j,i
=P
j−1,2i
|P
j−1,2i+1 (2)
where Pj,i is the i-th value of the hierarchical priority encoder at j-th layer, and i starts from 0 at each layer, and “|” is the logical OR operator. Equation (2) above also applies to “right side wins” criterion which is illustrated in.
At the lowest, i.e. root layer (j=log2N+1) (note we assume N is a power of 2, and if not, the remaining columns may be padded with input of 0 to make it a power of 2) it will be known whether there is at least one match. Then, the column ID of this match (if there is one), can also be determined hierarchically (for both left-side and right-side wins criterion) as shown in Table 3.
It is to be noted that Equation (4) in Table 3 effectively uses a 2:1 mux, and such a mux can be implemented using, logic gates, as illustrated in
After a winner candidate column is reported, it should be cleared (e.g. by clearing its corresponding input at j=0 layer) so that the priority encoder can report the next winner candidate. One embodiment of implementing this is by having a decoder circuit whose input is the just-reported column ID and whose output are N logic signals with only the signal corresponding to the just-reported column ID being 1 and the rest being 0, and these signals can then be used to control the clearing of the input at j=0 layer. To efficiently clear the input at j=0 layer (instead of having a general decoder which may add additional circuitry overhead), we also present a hierarchical reverse traversal mechanism (for both left-side and right-side wins criterion), as shown in Table 4.
The sub-expression & SELj,i in Equations (7a), (7b), (8a), (8b) in Table 5 are important for properly implementing hierarchical reverse traversal as illustrated in
As illustrated in
To allow reset of all column inputs at level j=0 at the beginning of priority encoding, SEL0,i as illustrated in
In addition to binary branches with hierarchical tie-breaking criterion which has been described above, m-array branches where in inputs at level j is merged into 1 intermediate/final output with a hierarchical tie-breaking criterion, may be used. The formulas for deriving the output decision, column ID (identifier), clearing after report, can all be derived following the working principles described for binary case, and should be familiar to those skilled in the art of digital design in view of the examples above.
In the following, interoperation among priority encoders (Inter-SubArray and Inter-Chip) according to various embodiments will be described.
In the following, Inter-SubArray will be described.
When a memory chip supporting vote-count contains multiple sub-arrays (where a sub-array is defined as the smallest memory cell array that can be operated upon with read and write operations), the queries can be carried out either for specific sub-array or for the entire chip. Each sub-array may have its own set of vote counters and priority encoder, and then the priority encoder for each sub-array (also referred to as a stage-1 priority encoder) may be merged together, hierarchically, into a large-scale priority encoder for the whole chip (the whole encoder minus the stage-1 encoders is also referred to as a stage-2 priority encoder). This is illustrated in
When merging, SELj,i and C*j,i at the root (i.e. bottom layer) of the stage-1 priority encoder are wired to the stage-2 priority encoder via the light-blue data bus shown in
The method of having stage-1 and stage-2 priority encoders, as illustrated in
The major concept is to have a simple control logic “mode” to let the chip work on sub-array level or on chip level. Suppose there are N blocks totally on the chip, because different blocks share a common set of columns, there could be only one block being activated at one moment during the query process. We use BEi (i ∈ {1, . . . , N}) to denote block enabling signals (‘0’ active) generated from the on-chip controller. We use SAi,1 (i ∈ {1, . . . , N} to denote the 1st sub-array in block i as shown in
The difference between sub-array level and chip level is that the former requires the priority encoder (and the vote counters) to work for each SAi,1(i ∈ {1, . . . , N}) and report the matched column IDs in the respective sub-arrays separately while the latter requires the priority encoder to wait until all the SAi,1 (i ∈ {1, . . . , N}) have been activated (i.e., their sub-pattern matching and vote-counting done) and then report the matched column IDs. BEi (i ∈ {1, . . . , N}) signal sequences are the same in both modes. There are 2 tasks for the control logic signal “mode” to do, one is to control the timing of PE′ (the enabling signal for the collective sequence of vote counting, threshold count comparing and priority encoding) being activated, the other is to have the matched column IDs to include the location information of each sub-array when working on sub-array level. These are achieved by the on-chip logics as shown in
1) Sub-array level (mode=0)
2) Chip level (mode=1)
In the following, Inter-Chip according to various embodiments will be described.
When a query is performed among multiple memory chips supporting original or enhanced vote count (each generally referred to as a VC chip), it is expected that the input of query string to the VC-chips and the output of the matched column IDs should be the same as those for single VC-chip. According to various embodiments, a highly scalable serialized design, for example as shown in
According to various embodiments, the following signals may be defined:
PE—Priority encoder enabling signal which is ‘0’ active, i.e., the priority encoder will only start to work when PE=‘0’. Note that PE is also the serialized input signal of the VC-chip.
PO′—Priority encoder output indicating signal which is IDs active, i.e., there is at least one matched column ID only when PO′=‘1’.
PO—The serialized output signal of the VC-chip.
PD′—A sequenced output of matched column IDs from the priority encoder.
PD—The tri-state output which can be connected to the output channel.
The Input Channel and Output Channel in this design refer to the shared data bus among all the VC-chips, which, could be a number of PCIe lanes, a number of AMBA AXI channels, etc. It will be understood that according to various embodiments, various different specific data bus standards may be used.
The on-chip logic for the above defined signals may be:
with initial condition PE1=0. The symbol “∩” denotes logical OR operator.
There may be several advantages according to various embodiments:
1) Simplicity—The entire query output process (also can be referred to as “aggregated priority encoder output”) is started by asserting PE1 to ‘0’ and the ending of the process is indicated by PON, where N is the total number of VC-chips.
2) High efficiency—There is no single cycle being wasted between the outputs from any 2 consecutive VC-chips. In case chip i (i ∈ {1, . . . , N}) has no matched column ID to output, the priority encoder of chip i+1 will be started immediately.
3) Scalability and flexibility—As long as the first and the last VC-chips are concerned, there could be any number of VC-chips in between. Any VC-chip can be removed from the chain by simply short-circuiting its PE and PO pins. Similarly, adding one VC-chip into the chain is also straightforward.
In the following, design optimizations for IC layout and heat dissipation considerations according to various embodiments will be described.
In the vote count algorithm without the interlocked design, activating all sub-arrays simultaneously (for matching against a query sub-pattern) may use too much power. To address this high power consumption and its resulting high heat dissipation issue, it can be arranged such that only some sub-arrays may be activated at a time instead. For example, all sub-arrays on the same horizontal level may be activated simultaneously, while other levels are not activated. Then, on the next access cycle, all sub-arrays on the next horizontal level are activated simultaneously, and so on.
In addition, such mode of operation allows saving of transistors for priority encoder and vote counters, by sharing such circuits across various horizontal levels. For example, in contrast to
Also, if a VC chip with no priority encoder or vote counter sharing is designed to report say the first 8 candidates, and there are 4 sub-arrays in the same vertical direction, then using priority encoder and vote counter sharing we may ask the VC chip to report the first 2 candidates per horizontal level, so that after processing all 4 horizontal levels the chip will at most report 8 candidates. However, the exact list of reported candidates may differ between the sharing and non-sharing case even when the database is the same and the same query pattern is used across the two cases. This is because by sharing the priority encoder, the output priority is also changed. For some applications, this discrepancy may not be a real issue.
DRAM, which can be used for implementing the vote count algorithm, generally shares a sense-amplifier between two adjacent bit-lines from either two adjacent sub-arrays (in Open array architecture), or two adjacent columns in the same sub-array (in Folded array architecture). Only one of these two bit-lines may be sensed at a time, because the other bit-line is used to provide a reference voltage to the sense-amplifier. This is similar in spirit to NAND Flash's Shielded bit-line sensing scheme as described above, therefore for all such bit-line pairs, we also refer to them as even and odd bit-lines, respectively.
In the presence of such sense-amplifier sharing, if transistor saving is preferred, the vote counters and priority encoder may also be shared by the even and odd bit-lines, then similar to the sharing of vote counters and priority encoder described above, the VC chip would need to perform the entire vote counting and priority encoding and reporting procedure for the even bit-lines before performing the same procedure for the odd bit-lines (or vice versa). And similarly the priority encoder's reported candidates could be different compared to the case with no priority encoder or vote counters sharing. When no priority encoder or vote counters are shared in DRAM-based vote count implementation, a 1:2 demux may be needed to route the shared sense amplifier's output to the vote counter circuit corresponding to either the even or odd bit-line.
Because NAND Flash's Shielded bit-line sensing scheme as described above, typically shares the sense-amplifier between two adjacent bit-lines, sometimes even with two additional such bit-lines from an adjacent sub-array, it is quite similar in spirit to DRAM's shared sense amplifier, and therefore in such case the vote counters and priority encoder may also be shared by those bit-lines sharing the sense amplifier, just like in the DRAM case, and it would need to perform the entire vote counting and priority encoding and reporting procedure for the even bit-lines before performing the same procedure for the odd bit-lines (or vice versa). If the sense-amplifier is shared by another two bit-lines from an adjacent sub-array, then the entire vote counting and priority encoding and reporting procedure has to be performed for the even and followed by odd bit-lines in one sub-array (or vice versa), before the same steps can be applied to the even and followed by odd bit-lines in the other, adjacent sub-array.
While the invention has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.
Number | Date | Country | Kind |
---|---|---|---|
10201400292T | Feb 2014 | SG | national |
10201400303Y | Feb 2014 | SG | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2015/000065 | 3/2/2015 | WO | 00 |