Implementations of this disclosure relates to the field of memory management technology, and particularly to a device and method for shared memory processing and a non-transitory computer storage medium.
Modern wireless mobile communication systems support great bandwidth and multiple carriers, and have various carrier processing capacities. Therefore, it is required that a system for signal processing not only has high processing capacity, but also can flexibly and rapidly make changes according to various capacity levels. However, on one hand, the current system for signal processing has limited processing capacity, on the other hand, there may be access conflicts when multiple processing units access a shared memory, which reduces processing efficiency.
Implementations of the disclosure provide a device and method for shared memory processing and a non-transitory computer storage medium.
In a first aspect, a device for shared memory processing is provided in implementations of the disclosure. The device for shared memory processing includes a set of shared memory units, a set of processing units, and a set of global clock synchronizers. Each shared memory unit corresponds to one global clock synchronizer and is coupled with K processing units via the corresponding global clock synchronizer, and the coupled K processing units perform conflict-free memory access to the shared memory unit during one instruction cycle of the corresponding global clock synchronizer. One instruction cycle of each global clock synchronizer includes N clocks, K is less than or equal to N, and K and N are integers greater than zero.
In a second aspect, a method for shared memory processing is provided in implementations of the disclosure and applicable to a device for shared memory processing. The device for shared memory processing includes a set of shared memory units, a set of processing units, and a set of global clock synchronizers. Each shared memory unit corresponds to one global clock synchronizer and is coupled with K processing units via the corresponding global clock synchronizer. The method for shared memory processing includes acquiring a status signal of each of the K processing units when the coupled K processing units respectively transmit access requests to the corresponding shared memory unit; determining a count value of a global counter in the corresponding global clock synchronizer; determining a to-be-responded processing unit in a current clock, according to the status signal and the count value determined; and accessing the shared memory unit in the current clock, according to the to-be-responded processing unit determined. One instruction cycle of each global clock synchronizer includes N clocks, K is less than or equal to N, and K and N are positive integers.
In a third aspect, a non-transitory computer storage medium storing a computer program is provided in implementations of the disclosure. The computer program is executed by a device for shared memory processing. The device for shared memory processing includes a set of shared memory units, a set of processing units, and a set of global clock synchronizers. Each shared memory unit corresponds to one global clock synchronizer and is coupled with K processing units via the corresponding global clock synchronizer. The computer program is executed by the device for shared memory processing to perform acquiring a status signal of each of the K processing units when the coupled K processing units respectively transmit access requests to the corresponding shared memory unit; determining a count value of a global counter in the corresponding global clock synchronizer; determining a to-be-responded processing unit in a current clock, according to the status signal and the count value determined; and accessing the shared memory unit in the current clock, according to the to-be-responded processing unit determined. One instruction cycle of the global clock synchronizer includes N clocks, K is less than or equal to N, and K and N are positive integers.
In order for more comprehensive understanding of features and technical solutions of implementations of the disclosure, the following will describe in detail implementations of the disclosure with reference to accompanying drawings. The accompanying drawings are merely intended for illustration rather than limitation on implementations of the disclosure.
Modem is short for modulator and demodulator. Specifically, a modem is an electronic device that can realize modulation and demodulation required for communication. At a transmitting end, a modem is configured to modulate digital signals generated by a computer serial interface into analog signals that can be transmitted on telephone cables. At a receiving end, a modem is configured to convert analog signals input to a computer into corresponding digital signals and transmit the digital signals to a computer interface. In a personal computer, a modem is often configured to exchange data and programs with other computers, and access online information service programs, etc. Here, the modulation refers to conversion of digital signals into analog signals transmitted on telephone cables, and the demodulation refers to conversion of analog signals into digital signals, which is collectively called modem.
Modern wireless mobile communication systems support great bandwidth and multiple carriers, and have various carrier processing capacities. Therefore, it is required that a system for signal processing not only has high processing capacity, but also can flexibly and rapidly make changes according to various capacity levels. Therefore, in implementations of the disclosure, an efficient and flexible signal processing sub-system is provided, which is crucial to design of a modem.
The following will describe in detail implementations of the disclosure with reference to the accompanying drawings.
Referring to
In some implementations, as illustrated in
Correspondingly, the set of global clock synchronizers 130 may include at least three global clock synchronizers. The input memory unit can be coupled with multiple processing units in the set of processing units 120 via a corresponding global clock synchronizer. The output memory unit can be coupled with multiple processing units in the set of processing units 120 via a corresponding global clock synchronizer. The one or more scratchpad memory units can also be coupled with multiple processing units in the set of processing units 120 via a corresponding global clock synchronizer.
It is to be noted that the number of the processing units coupled with each shared memory unit is related to the instruction cycle of the global clock synchronizer. Assuming that one instruction cycle includes four clocks, the number of the processing units coupled with each shared memory unit does not exceed four. In the case, for a certain shared memory unit, each of corresponding processing units accesses the shared memory unit during one of four different clocks of one instruction cycle, and thus there will be no memory access conflict.
It is to be further noted that, in some implementations, as illustrated in
The external interface herein may be a network on chip (NOC), advanced high performance bus (AHB), or multi core-interconnect, etc., which is not limited in implementations of the disclosure.
In implementations of the disclosure, the external interface generally adopts the NOC. The NOC is a new on-chip communication method of a system on chip (SOC), which is the main component of multi-core technology. In addition, the NOC is a new on-chip communication method, and performance of the NOC is significantly superior to performance of traditional bus systems.
In other words, the input memory unit and the output memory unit both are dual-port random access memories (RAMs). One port (the first input-port or the first output-port) is directly coupled with the NOC, and the other port (the second input-port or the second output-port) is coupled with specific processing units in the device 10 for shared memory processing. The NOC may carry various system data and has relatively strong randomness, such that data interaction between outside and inside of the device 10 for shared memory processing via direct memory access (DMA) may be interrupted at any time. However, with dual-port RAM design in implementations of the disclosure, data reading and data writing of the processing unit in the device 10 for shared memory processing can be isolated from interaction with external data, and thus it can be ensured that data reading and data writing of the processing unit in the device 10 for shared memory processing are not affected by interaction with external data.
In some implementations, as illustrated in
The signal processing unit herein may be a vector digital signal processor (VDSP), and the hardware accelerating unit may be a hardware accelerator (HWA). Both the signal processing unit and the hardware accelerating unit belong to data processing units. The signal processing unit and the hardware accelerating unit are both responsible for reading data from a corresponding shared memory unit, processing read data, and writing processing results to the shared memory unit.
It is to be further noted that, for the set of processing units 120, to adapt to an instruction cycle including N different clocks, a shared memory unit is coupled with specific processing units, such that it can be ensured that each shared memory unit can be accessed by N processing units at most, and the N processing units are synchronous in timing sequence, thereby realizing conflict-free access of the N processing units to the shared memory unit in N different clocks of a same instruction cycle.
Furthermore, in some implementations, as illustrated in
In this way, for the device 10 for shared memory processing, the most significant feature is that all processing units in the device 10 for shared memory processing can perform conflict-free memory access to the shared memory unit, and with the dual-port design, access to the shared memory unit is isolated from access of external data, such that the device 10 for shared memory processing has high processing efficiency and stable and predictable processing delay, and is easy to be extended.
In implementations of the disclosure, the set of shared memory units 110 include four shared memory units in a case where two scratchpad memory units are included. Correspondingly, the set of global clock synchronizers 130 also include four global clock synchronizers.
In some implementations, the one or more scratchpad memory units may include a first vector-memory-unit and a second vector-memory-unit. Specifically, as illustrated in
The input memory unit 1101 is coupled with K1 processing units via the first global-clock-synchronizer 1301, the output memory unit 1102 is coupled with K2 processing units via the second global-clock-synchronizer 1302, the first vector-memory-unit 1103 is coupled with K3 processing units via the third global-clock-synchronizer 1303, and the second vector-memory-unit 1104 is coupled with K4 processing units via the fourth global-clock-synchronizer 1304, where K1, K2, K3, and K4 are positive integers less than or equal to N.
In implementations of the disclosure, the global clock synchronizer (grant clock synchronizer, GC-Sync) may also be regarded as an arbiter, and is configured to resolve access conflicts between multiple processing units coupled with a same shared memory unit by assigning each of the multiple processing units to perform memory access during one of different clocks, so as to achieve conflict-free memory access.
For the device 10 for shared memory processing illustrated in
It is to be further noted that, as illustrated in
In addition, since a protocol for the external interface is different from a protocol for the first input-port and a protocol for the first output-port. In the case, there is an interface conversion component between the external interface and the first input-port and an interface conversion component between the external interface and the first output-port. Therefore, in some implementations, as illustrated in
In other words, in
In some implementations, the set of processing units 120 may include the at least one signal processing unit and/or the at least one hardware accelerating unit.
As illustrated in
In the case, assuming that one instruction cycle includes four clocks, the K1 processing units coupled with the input memory unit 1101 may include four processing units, i.e., the first vector-signal-processing-unit 1201, the second vector-signal-processing-unit 1202, the first hardware-accelerating-unit 1205, and the second hardware-accelerating-unit 1206, the K2 processing units coupled with the output memory unit 1102 may include four processing units, i.e., the third vector-signal-processing-unit 1203, the fourth vector-signal-processing-unit 1204, the first hardware-accelerating-unit 1205, and the second hardware-accelerating-unit 1206, the K3 processing units coupled with the first vector-memory-unit 1103 may include four processing units, i.e., the first vector-signal-processing-unit 1201, the second vector-signal-processing-unit 1202, the third vector-signal-processing-unit 1203, and the fourth vector-signal-processing-unit 1204, and the K4 processing units coupled with the second vector-memory-unit 1104 may include four processing units, i.e., the first vector-signal-processing-unit 1201, the second vector-signal-processing-unit 1202, the first hardware-accelerating-unit 1205, and the second hardware-accelerating-unit 1206. In this case, the four global clock synchronizers are respectively illustrated as follows. The first global-clock-synchronizer 1301 is configured to achieve conflict-free memory access of the first vector-signal-processing-unit 1201, the second vector-signal-processing-unit 1202, the first hardware-accelerating-unit 1205, and the second hardware-accelerating-unit 1206 to the input memory unit 1101 during one instruction cycle. The second global-clock-synchronizer 1302 is configured to achieve conflict-free memory access of the third vector-signal-processing-unit 1203, the fourth vector-signal-processing-unit 1204, the first hardware-accelerating-unit 1205, and the second hardware-accelerating-unit 1206 to the output memory unit 1102 during one instruction cycle. The third global-clock-synchronizer 1303 is configured to achieve conflict-free memory access of the first vector-signal-processing-unit 1201, the second vector-signal-processing-unit 1202, the third vector-signal-processing-unit 1203, and the fourth vector-signal-processing-unit 1204 to the first vector-memory-unit 1103 during one instruction cycle. The fourth global-clock-synchronizer 1304 is configured to achieve conflict-free memory access of the first vector-signal-processing-unit 1201, the second vector-signal-processing-unit 1202, the first hardware-accelerating-unit 1205, and the second hardware-accelerating-unit 1206 to the second vector-memory-unit 1104 during one instruction cycle.
In other words, as illustrated in
In implementations of the disclosure, the device 10 for shared memory processing may be regarded as a vector signal processing sub-system or a vector processing cluster (VPC). The device 10 for shared memory processing may include the set of shared memory units 110 (or called a set of vector memories (VMEMs)), the set of processing units 120 (or called a set of vector digital signal processors (VDSPs) and/or a set of hardware accelerators (HWAs)), the set of global clock synchronizers 130, the TS 140, and a coupling between each specific set of processing units and a corresponding shared memory unit. The details are illustrated in
In other words, the device 10 for shared memory processing may be constituted by four VMEMs, four VDSPs, two HWAs, four global clock synchronizers, a specific coupling between each of the VMEMs and different VDSPs/HWAs, and the TS. The number of the processing units coupled with each VMEM does not exceed four to adapt to an instruction cycle having four clocks for the processing units. The input memory unit 1101 (i.e., input VMEM) can be accessed by the first vector-signal-processing-unit 1201 (VDSP1), the second vector-signal-processing-unit 1202 (VDSP2), the first hardware-accelerating-unit 1205 (HWA1), and the second hardware-accelerating-unit 1206 (HWA2). The first vector-memory-unit 1103 (i.e., scratch VMEM A) can be accessed by the first vector-signal-processing-unit 1201 (VDSP1), the second vector-signal-processing-unit 1202 (VDSP2), the third vector-signal-processing-unit 1203 (VDSP3), and the fourth vector-signal-processing-unit 1204 (VDSP4). The second vector-memory-unit 1104 (scratch VMEM B) can be accessed by the first vector-signal-processing-unit 1201 (VDSP1), the second vector-signal-processing-unit 1202 (VDSP2), the first hardware-accelerating-unit 1205 (HWA1), and the second hardware-accelerating-unit 1206 (HWA2). The output memory unit 1102 (i.e., output VMEM) can be accessed by the third vector-signal-processing-unit 1203 (VDSP3), the fourth vector-signal-processing-unit 1204 (VDSP4), the first hardware-accelerating-unit 1205 (HWA1), and the second hardware-accelerating-unit 1206 (HWA2). It is to be further noted that each of the processing units (VDSPs and/or HWAs) may further include a memory register (MR).
The VDSP and the HWA both are data processing units, are responsible for reading data from the shared memory unit and processing read data, and then writing results into the shared memory unit. The TS is responsible for receiving a task message transmitted from outside and assigning the task message to a specific processing unit (VDSP or HWA).
The set of shared memory units 110 may include one input memory (i.e., the input memory unit 1101), one output memory (i.e., the output memory unit 1102), and some scratchpad memories (such as the first vector-memory-unit 1103 and the second vector-memory-unit 1104). The input memory and the output memory are dual-port RAMs, where one port is directly coupled with the NOC, and the other port is coupled with specific processing units in the device 10 for shared memory processing. The NOC may carry various system data and has relatively strong randomness, such that data interaction between outside and inside of the device through DMA may be interrupted at any time. However, with the dual-port RAM design, data reading and data writing of the processing unit in the device can be isolated from interaction with external data, and thus it can be ensured that data reading and data writing of the processing unit in the device are not affected by interaction with external data. Each of the VMEMs can be coupled with four processing units at most to adapt to an instruction cycle including four clocks for the VDSPs. In this way, the four processing units will not have memory access conflicts if each of the VDSPs accesses the VMEM in one of four different clocks of one instruction cycle.
It is to be further noted that, the memory is coupled with specific processors such that it can be ensured that each shared memory unit can be accessed by N processing units at most, and the N processing units are synchronous in timing sequence, thereby realizing conflict-free access of the N processing units to a specific shared memory unit in N different clocks of a same instruction cycle.
In implementations of the disclosure, the global clock synchronizer can be responsible for resolving access conflicts between multiple processing units coupled with a same shared memory unit by assigning each of the processing units to perform memory access during one of different clocks, so as to ensure the orthogonality of memory access of the processing units. In this way, the processing can be simplified in the case that the number of the processing units coupled with the global clock synchronizer is less than or equal to the number of clocks of one instruction cycle, that is, the global clock synchronizer performs conflict resolution only when memory access conflict occurs for the first time. After the conflict occurring for the first time is resolved, timing sequence synchronization can be achieved subsequently, and the processing units will not have memory access conflicts.
In some implementations, for the device 10 for shared memory processing illustrated in
Furthermore, the global clock synchronizer is configured to select to respond to an access request from an i-th processing unit when the coupled K processing units respectively transmit access requests to a corresponding shared memory unit, in response to a status signal received from the i-th processing unit being at a high level and the count value of the global counter being equal to i.
Furthermore, the global clock synchronizer is configured to delay an instruction corresponding to the access request by one clock and keep the status signal of the i-th processing unit at the high level when the coupled K processing units respectively transmit access requests to a corresponding shared memory unit, in response to the status signal received from the i-th processing unit being at the high level and the count value of the global counter being not equal to i.
Here, i represents an index value of the i-th processing unit, and i is a positive integer less than or equal to K.
In other words, for a certain shared memory unit, the global clock synchronizer can maintain the memory-access slot assigned to each of the processing units via one global counter (grant counter), where the count value of the global counter increases one during each clock, and the global counter recounts from 0 when the count value fulfills K-1 (K is the number of the processing units coupled with the shared memory unit). When one or more processing units need to access the shared memory unit, a corresponding status signal (which can be represented by COREn_RD signal) will be pulled up. Upon reception of the COREn_RD signal, the global clock synchronizer selects to respond to one processing unit according to the current state of the grant counter (which can be reflected by the count value). Specifically, a to-be-responded processing unit needs to meet two conditions: (a) the COREi_RD signal transmitted by the processing unit is at high level; (b) the current count value of the global counter is i. However, if no response is made to a processing unit that transmits a COREn_RD signal request, an internal instruction pipeline for the processing unit is delayed by one clock, and the COREn_RD signal is kept at the high level.
As illustrated in
As illustrated in
In combination with the working principles illustrated in
In some implementations, all units of the device 10 for shared memory processing can be integrated in a same chip. All units, i.e., the set of shared memory units 110, the set of processing units 120, the set of global synchronizer 130, and the TS 140, etc., can be integrated in a same chip.
Briefly, in implementations of the disclosure, by means of dividing the shared memory (such as divided into the input memory unit, the output memory unit, and the one or more scratchpad memory units), each shared memory unit is only coupled with and accessed by processing units, where the number of the processing units is adapted to the number of clocks of an instruction cycle for the processing units, which can avoid memory access conflicts between the processing units to the greatest extent. Moreover, the input/output memory unit adopts a dual-port structure, such that data processing in the device 10 for shared memory processing can be isolated from interaction with external data, thereby eliminating the interference to internal access to the shared memory in the device and the interference of the internal access to the input memory unit and the output memory unit in the device to the external data. In addition, access of processing units coupled with a same shared memory unit to the shared memory unit may be orthogonal via the global clock synchronizer.
The device for shared memory processing is provided in implementations of the disclosure. The device for shared memory processing includes the set of shared memory units, the set of processing units, and the set of global clock synchronizers. Each shared memory unit corresponds to one global clock synchronizer and is coupled with K processing units via the corresponding global clock synchronizer. The coupled K processing units perform conflict-free memory access to the shared memory unit during one instruction cycle. One instruction cycle of the global clock synchronizer includes N clocks, K is less than or equal to N, and K and N are integers greater than zero. In this way, on one hand, in the device for shared memory processing, multiple processing units can perform conflict-free memory access to a same shared memory unit, such that the device for shared memory processing is easy to be extended, and therefore modems that have different processing capacity levels can be achieved by means of increasing the number of devices for shared memory processing; on the other hand, the access to the shared memory unit in the device for shared memory processing can be isolated from access of external data, thereby eliminating the interference to internal access to the shared memory unit in the device for shared memory processing and the interference of the input/output memory unit to external data. In addition, since the device for shared memory processing realizes efficient and conflict-free memory access, the processing delay is stable and can be predicted, and the processing efficiency is improved.
Referring to
Referring to
It is to be noted that, the device 10 for shared memory processing can be regarded as a vector signal processing sub-system, or called VPC. Multiple devices for shared memory processing can constitute the system 40 for signal processing. Moreover, the system 40 for signal processing not only has high processing capacity, but also can flexibly and rapidly make changes according to different capacity levels.
It is to be further noted that, for the device 10 for shared memory processing, the most significant feature is that all processing units in the device can access the shared memory unit without conflict, and with dual-port design, internal access to the shared memory unit in the device can be isolated from access of external data, such that the device has high processing efficiency and stable and predictable processing delay, and is easy to be extended. In this way, modems that have different processing capacities can be realized rapidly by means of connecting different numbers of the devices 10 for shared memory processing with NOC of the modem 50.
In implementations of the disclosure, the processing units in the device can perform conflict-free access, which is not affected by external NOC data flow, and also does not affect data transmission of the NOC. Therefore, modems that have different capacity levels can be realized stably and rapidly by means of increasing the number of the devices simply, and thus rapid customization of the modem 50 having different capacities can be realized. Moreover, in the device 10 for shared memory processing, each of the processors in the device can perform conflict-free access to the shared memory via division of the shared memory, the dual-port input/output RAM, the coupling between specific processors and the memory, and the global clock synchronizer, etc. Additionally, because of conflict-free memory access, processing timing for the device may be computable and predictable, the device is stable and can be extended, so as to achieve efficient and conflict-free memory access, which is of great significance for rapid design of an efficient and stable modem.
Referring to
At S601, acquire a status signal of each of coupled K processing units when the coupled K processing units respectively transmit access requests to a corresponding shared memory unit.
At S602, determine a count value of a global counter of a global clock synchronizer.
At S603, determine a to-be-responded processing unit in a current clock, according to the status signal and the count value determined.
At S604, access the shared memory unit in the current clock, according to the to-be-responded processing unit determined.
It is to be noted that, the method for shared memory processing is applicable to the device 10 for shared memory processing of any one of implementations mentioned above. The device 10 for shared memory processing may include the set of shared memory units, the set of processing units, and the set of global clock synchronizers. Each shared memory unit corresponds to one global clock synchronizer and is coupled with K processing units via the corresponding global clock synchronizer. The coupled K processing units can perform conflict-free memory access to the shared memory unit during one instruction cycle. The one instruction cycle of the global clock synchronizer includes N clocks, K is less than or equal to N, and K and N are integers greater than zero.
It is to be further noted that the number of the processing units coupled with each shared memory unit is related to the instruction cycle of the global clock synchronizer. Assuming that one instruction cycle includes four clocks, the number of the processing units coupled with each shared memory unit does not exceed four. In the case, for a certain shared memory unit, each of corresponding processing units accesses the shared memory unit during one of four different clocks of one instruction cycle, and thus there will be no memory access conflict.
In some implementations, the set of shared memory units may include at least three shared memory units, and the at least three shared memory units may include the input memory unit, the output memory unit, and the one or more scratchpad memory units.
The input memory unit adopts a dual-port structure, and the output memory unit adopts a dual-port structure, such that data reading and data writing of the processing unit in the device 10 for shared memory processing can be isolated from interaction with external data, and thus it can be ensured that data reading and data writing of the processing unit in the device 10 for shared memory processing are not affected by interaction with external data.
In some implementations, the set of processing units may include at least one signal processing unit and/or at least one hardware accelerating unit.
Both the signal processing unit and the hardware accelerating unit belong to data processing units. The signal processing unit and the hardware accelerating unit are both responsible for reading data from a corresponding shared memory unit, processing read data, and then writing processing results to the shared memory unit.
It is to be further noted that, for the set of processing units, to adapt to an instruction cycle including N different clocks, a shared memory unit is coupled with specific processing units, such that it can be ensured that each shared memory unit can be accessed by N processing units at most, and the N processing units are synchronous in timing sequence, thereby realizing conflict-free access of the N processing units to the shared memory unit in N different clocks of a same instruction cycle.
Furthermore, the device for shared memory processing may further include a TS. The TS is coupled with an external interface and the set of processing units. Therefore, in some implementations, the method further includes receiving a task message transmitted through the external interface, forwarding the task message to a processing unit for execution in the set of processing units via the TS, and performing a task corresponding to the task message via the processing unit for execution.
It is to be noted that the processing unit for execution is a specific processing unit configured to execute the task corresponding to the task message in the set of processing units. The processing unit for execution herein may be the signal processing unit or the hardware accelerating unit, which is not limited in implementations of the disclosure.
It is to be further noted that, the global clock synchronizer can be responsible for resolving access conflicts between multiple processing units coupled with a same shared memory unit by assigning each of the processing units to perform memory access during one of different clocks, so as to ensure the orthogonality of memory access of the processing units. In this way, the processing can be simplified when the number of the processing units coupled with the global clock synchronizer is less than or equal to the number of clocks of the instruction cycle, that is, the global clock synchronizer performs conflict resolution only when memory access conflict occurs for the first time. After the conflict occurring for the first time is resolved, timing sequence synchronization can be achieved subsequently, and the processing units will not have memory access conflict.
In some implementations, each global clock synchronizer may include a global counter. The global counter is configured to control a memory-access slot assigned to each of the coupled K processing units, and a corresponding count value is increased by one during each clock. When the count value fulfills K-1, the count value is cleared and the global counter recounts.
Furthermore, in some implementations, for S603, determining the to-be-responded processing unit in the current clock according to the status signal and the count value determined may include determining an i-th processing unit as the to-be-responded processing unit in the current clock, in response to a status signal of the i-th processing unit being at a high level and the count value determined being equal to i, where i represents an index value of the i-th processing unit, and i is a positive integer less than or equal to K.
Furthermore, in some implementations, the method may include keeping the status signal of the i-th processing unit at the high level and delaying an instruction corresponding to the access request by one clock, in response to the status signal of the i-th processing unit being at the high level and the count value determined being not equal to i; and determining the i-th processing unit as the to-be-responded processing unit in the current clock in response to the count value determined being equal to i, after delaying the instruction by one clock.
It is to be noted that the count value is increased by one when the instruction is delayed by one clock. When the count value fulfills K-1, the count value of the global counter needs to be cleared and the global counter recounts. In this way, after delaying the instruction by one clock, it can be re-determined whether the count value is i and whether the status signal of the i-th processing unit is at the high level. If the count value is not i and/or the status signal of the i-th processing unit is not at the high level, delaying the instruction by one clock is re-executed. If the count value is i and the status signal of the i-th processing unit is at the high level, the i-th processing unit can be determined as the to-be-responded processing unit in the current clock, and then access the shared memory unit in the current clock according to the to-be-responded processing unit determined.
In other words, for a certain shared memory unit, the global clock synchronizer can maintain the memory-access slot assigned to each of the processing units via one global counter, where the count value of the global counter increases one in each clock, and the global counter recounts from 0 when the count value fulfills K-1 (K is the number of the processing units coupled with the shared memory unit). When one or more processing units need to access the shared memory unit, a corresponding status signal (which can be represented by COREn_RD signal) will be pulled up. Upon reception of the COREn_RD signal, the global clock synchronizer selects to respond to one processing unit according to the current state of the grant counter (which can be reflected by the count value). Specifically, a processing unit to-be-responded needs to meet two conditions: (a) the COREi_RD signal transmitted by the processing unit is at high level; (b) the count value of the global counter is i. However, if no response is made to a processing unit that transmits a COREn_RD signal request, an internal instruction pipeline for the processing unit is delayed by one clock, and the COREn_RD signal is kept at the high level.
In combination with the working principles illustrated in
The method for shared memory processing is provided in the implementation and applicable to the device for shared memory processing. The method includes acquiring a status signal of each of K processing units when the coupled K processing units respectively transmit access requests to a corresponding shared memory unit; determining a count value of a global counter of a global clock synchronizer; determining the to-be-responded processing unit in the current clock, according to the status signal and the count value determined; accessing the shared memory unit in the current clock, according to the to-be-responded processing unit determined. One instruction cycle of the global clock synchronizer includes N clocks, K is less than or equal to N, and K and N are integers greater than zero. In this way, multiple processing units can perform conflict-free memory access to a same shared memory unit in the device for shared memory processing, such that the device for shared memory processing is easy to be extended, and therefore modems that have different processing capacity levels can be achieved by means of increasing the number of the devices for shared memory processing. Moreover, internal access to the shared memory unit in the device for shared memory processing can be isolated from access of external data, thereby eliminating the interference to the internal access to the shared memory unit in the device for shared memory processing and the interference of the input/output memory unit to the external data. In addition, since the device for shared memory processing realizes efficient and conflict-free memory access, the processing delay may be stable and predictable, and the processing efficiency is improved.
It could be understood that, the device 10 for shared memory processing provided in implementations of the disclosure may be an integrated circuit chip with signal processing capacity. During implementation, each step of the foregoing implementations of the method may be completed by combining an integrated logic circuit in the form of hardware in the device 10 for shared memory processing with an instruction in the form of software. Based on such understanding, some functions of the technical solution of the disclosure can be embodied in the form of software products. Therefore, a computer storage medium storing a computer program is provided in the implementation. Steps of the method for shared memory processing of the foregoing implementations are performed when the computer program is performed by the device for shared memory processing.
Those of ordinary skill in the art will appreciate that units and algorithmic operations of various examples described in connection with implementations of the disclosure can be implemented by electronic hardware or by a combination of computer software and electronic hardware. Whether these functions are performed by means of hardware or software depends on the specific application and the design constraints of the technical solution. Those skilled in the art may use different methods with regard to each particular application to implement the described functionality, but such implementation should not be regarded as lying beyond the scope of the disclosure.
It will be evident to those skilled in the art that, for the sake of convenience and simplicity, in terms of the working processes of the foregoing systems, devices, and units, reference can be made to the corresponding processes of the above method implementations, which will not be repeated herein.
It is to be noted that, in the disclosure, the term “comprise” and “contain” as well as variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. Without further restrictions, the element defined by the statement “including one . . . ” does not exclude the existence of another identical element in the process, method, article or device including the element.
The serial number of implementations of the disclosure is only for illustration and does not represent the advantages and disadvantages of implementations.
The methods disclosed in several method implementations provided in the disclosure can be combined arbitrarily to obtain new method implementations without conflict.
The features disclosed in several product implementations provided in the disclosure can be combined arbitrarily to obtain new product implementations without conflict.
The features disclosed in several method or device implementations provided in the disclosure can be combined arbitrarily to obtain new method implementations or device implementations without conflict.
The above is only the specific implementations of the disclosure, but the protection scope of the disclosure is not limited to the above. Any skilled in the technical field can easily think of changes or replacements within the technical scope of the disclosure, and the changes or replacements should fall in the protection scope of the disclosure. Therefore, the protection scope of the disclosure shall be subject to the protection scope of the claims.
In implementations of the disclosure, multiple processing units can perform conflict-free memory access to a same shared memory unit in the device for shared memory processing, such that the device for shared memory processing is easy to be extended, and therefore modems that have different processing capacity levels can be achieved by means of increasing the number of the devices for shared memory processing. Moreover, internal access to the shared memory unit in the device for shared memory processing can be isolated from access of external data, thereby eliminating the interference to the internal access to the shared memory unit in the device for shared memory processing and the interference of the input/output memory unit to the external data. In addition, since the device for shared memory processing realizes efficient and conflict-free memory access, the processing delay may be stable and predictable, and the processing efficiency is improved.
This application is a continuation of International Application No. PCT/CN2020/106648, filed Aug. 3, 2020, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/106648 | Aug 2020 | US |
Child | 18063298 | US |