The present invention relates to efficient utilization of a multi-core processing system and more specifically to an apparatus and method directed to process synchronization of embedded applications in multi-core processing systems while maintaining memory coherency.
The shift toward multi-core processor chips poses challenges to synchronizing the operations running on each core in order to fully utilize the enhanced performance opportunities presented by multi-core processors (e.g., running different applications on different processor cores at the same time and running different operations of the same application on different processor cores). However, present methods of synchronizing operations such as lock/semaphore require atomic instructions (e.g., test-and-set, swap, etc.) or interrupt disabling are difficult to implement and can lead to race conditions, deadlocks and inefficient use of the processors. Accordingly, there exists a need in the art to mitigate the deficiencies and limitations described hereinabove.
A first aspect of the present invention is a system for process synchronization in a multi-core computer system, comprising: a primary processor core to control scheduling, completion and synchronization of a plurality of processing threads for the SOC, the primary processor core having a dedicated memory region to facilitate control of processes; a plurality of secondary processor cores each coupled to the primary processor core via address and control line bus architecture, the plurality of secondary processor cores responsive to command inputs from the primary processor core to execute instructions and each having dedicated memory to facilitate control of processes; a first memory wherein the primary processor core and each secondary processors core of the plurality of secondary processor cores have read access to all addresses of said first memory, and wherein write access to the first memory by the primary processor core and each secondary processor core of the plurality of secondary processor cores is restricted to respective address regions; and a switch matrix enabling intra-core communication between the primary processor core and any secondary processor core of the plurality of secondary processor cores and between any pair of secondary processor cores of the plurality of secondary processor cores, according to a pre-defined transmission protocol.
A second aspect of the present invention is a method for process synchronization in a multi-core computer system, comprising: providing a first memory having a dedicated domain for each processor core of a plurality of processor cores, each of the dedicated domains readable by any of the plurality of processor cores; providing a second memory having a dedicated domain for each processor core of a plurality of processor cores; writing a value to an address allocated to a first processor core of the plurality of processor cores in the first memory such that a busy or idle state of the first core may be read by each of the remaining plurality of processor cores; maintaining a value matrix in the second memory for each of the plurality of processor cores enabling a corresponding processor core to monitor the busy and idle states of each of the other processor cores; applying an exclusive ‘OR’ to the value matrix entry for each one of the plurality of processor cores when a busy or idle state of the corresponding one of the plurality of processors changes; and writing the result of the exclusive ‘OR’ operation to a corresponding domain of the first memory to update the status of the corresponding one of the plurality of processor cores.
These and other aspects of the invention are described below.
The features of the invention are set forth in the appended claims. The invention itself, however, will be best understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a first memory having dedicated write address space for each processor core of a multi-core processor and common read access to all address space by all processor cores. The present invention also provides a multiplicity of processor dedicated second memories that are linked to the first memory. The first and second memories provide a mechanism for indicating synchronization information such as processor status (e.g., busy, idle, error), an event occurrence or an instruction is pending and between which of the multiple processor cores the synchronization information is to be communicated.
In one example, processor core 120A is a primary processor core and processor cores 120B, 120C and 120D are secondary processor cores. A primary processor core controls scheduling, completion and synchronization of processing threads on all processor cores to ensure each process has reached a required state before further processing can occur. Secondary processor cores are responsive to command outputs from the primary processor core to execute instructions. Secondary processor cores can also synchronize with each other. Synchronization can be implemented as synchronization points where all secondary processor cores wait for a signal from the primary processor core. On reaching the synchronization point the primary processor core sets the signal to all secondary processor cores and waits for acknowledgement from all the secondary processor cores. On receiving the acknowledgement from the secondary processor cores, the primary processor core instructs the secondary processor core to proceed (e.g., to the next synchronization point).
In one example processor cores 120A, 120B, 120C and 120D are multi-threaded processors. A multithreading processor runs more than one task's instruction stream (thread) at a time. To do so, the processor core has more than one program counter and more than one set of programmable registers. The embodiments of the present invention are applicable to single thread processors and can be extended to multi-threaded processors by treating each thread as a core.
It should be understood that dedicated memories 130A, 130B 130C and 130D and OCM memory need not be physically different memory cores but in one example, partitions of the same memory core. In another example, dedicated memories 130A, 130B, 130C and 130D are partitions of a first memory core and OCM is a second memory core.
Each processor core 120A through 120D can write to only one dedicated (and different row) of OCM 125 while all processor cores 120A through 120D can read all rows of OCM 125. Alternatively, throughout the description of the invention “column” may substituted for all instances of “row” and “row” substituted for all instances of “column.” The lines labeled R and W are implemented as a switch matrix enabling processor core to processor core communication. As described infra, the source of information written to OCM 125 is from write domains 135A through 135D (see
In the example of
In the more general case of m processor cores having respective m dedicated write domains (where i=0 to m−1 and j=0 to m−1), when processor core i wants to send a synchronization signal to processor core j it uses the (i,j)th location of the ith write domain and the (i,j)th location of the OCM to do so. After sending the synchronization signal to the OCM, processor core i changes the value (toggles between 0 and 1 if n=1) in the (i,j)th location of write domain (i). Similarly, processor core j waits for the (i,j)th location of the OCM to change from the value currently in the (i,j)th location of the OCM to a different value then currently in the (i,j)th location of write domain (j). When the value changes, this new value is written to the (i,j)th location of write domain (j) overwriting the old value.
When n=1, the synchronization is a two state machine and the synchronization signal is reduced to changing the state of the (i,j)th locations. A powerful use of the present invention in a two state mode (i.e., busy and idle) is the ability of the primary core to know when a secondary processor is idle and then issue instructions for the idle secondary processor to initiate another process. In such a two state system, the primary processor core can direct the timing of the execution of processes on the secondary processor cores by waiting until all secondary processor cores are idle, to ensure processes that must be completed before other processes can start have been completed. In other words, to automatically and quickly detect that a process-synchronization point has been obtained. The secondary processor cores can then be assigned further processes by instructions sent by the primary core processors by normal command routes. When n is greater than 2, then the synchronization is a 211 state machine. Toggling may be accomplished using an exclusive “OR.” The system is initialized by writing the same value to all (i,j)th locations of all write domains of all dedicated memories and to all (i,j)th locations of the OCM.
In a general single processor core system, maintaining coherency is the responsibility of the operating system, and the application developer need not worry about that. However, in a multi-processor core system, the developer has to take care of these issues. These issues were studied using a system simulator model for an eight core system-on-chip with 1 MB on-chip non-caching shared memory. Open source GNU (GNU's NOT UNIX) tools for developing embedded PowerPC applications were used for software development. The system was programmed in programming language ‘C’, embedding assembler code for cache related operations.
The model included: (1) Processors are numbered from 0 to (m−1), where m is the number of processors. (2) Processor 0 is the primary processor and the other processors are secondary processors. The master processor performs I/O operations. (3) Programs which are expected to be executed by various processors are loaded in specific ranges of memory as configured in the scripts for the memory loader. (4) Since programs are loaded in specific ranges the processor identification number was obtained by a small routine GetMyid( ). (5) The synchronization signal scheme described in relation to
Various routines used are listed and include:
int GetMyid(void)—used by processors to get their process identification (ID) number;
void setsignal(int id)—the processor sets the signal using its processor ID number;
void waitsignal(int id)—a processor waits for a signal from a processor with its processor ID number;
void sync(void)—synchronization mechanism, while processor ID 0 sets the signal, all other processors wait for a signal from processor ID 0. On receiving the signal from processor ID 0, a processor other than processor ID 0 sets a signal to processor ID 0 and processor ID 0 waits for signal from all others processors;
void signaltoproc(int toid)—used by a processor to set a signal for a particular processor;
void waitforproc(int fromid)—used by a processor to wait for a signal from a particular processor;
void checksignal(int fromid)—used by a processor to check whether a signal is ready from processor fromid, but value location is not modified for which a waitforproc(fromid) is needed;
void clearsignals(void)—used by the primary processor to clear the signal locations, before ending the execution. The routine can also be used by a serial program to clear the signal memory before running the real parallel application;
void store Cache(unsigned long addr)—store the cache line which holds the memory address addr;
void invalidate Cache(unsigned long addr)—invalidate the cache line which holds the memory address addr; and
void flushCache(unsigned long addr)—flush the cache line which holds the memory address addr.
On-chip memory was portioned into several sections. The signal vector and matrix was stored in a non-cached on-chip shared memory section starting at address 0xc0000000. (This is memory 125 of
The programming sequence was: (1) Each processor received its processor ID number. (2) Processor ID 0 initialized the input section stored in OCM and in a separate loop the memory locations were stored to cache memory, so that the OCM was synchronized with cache. Storing was done in a separate loop to avoid storing of already stored cache lines. Then processor ID 0 synchronization signals for all other processors. No explicit cache operations are needed for other processors since the other processors had not yet used any values from OCM (3) All processors computed their share of computation by avoiding frequent reference to a write-through memory. Hence summing was done on a local variable and finally the results were stored in the output section of OCM. (4) Processor ID 0 invalidated the cache value of the output section of OCM, so that further computation loaded the correct value from the OCM.
An unexpected efficiency of the eight processor core system using the architecture of present invention was about 95%. The speed-up of the eight processor core system using the present invention was about 7.5. Speed-up is defined as the ratio of the execution time of a system with one processor core to the execution time of a system with m processor cores. Efficiency is 100 times (Speed-up/m).
Because the shared memory for the first memory is not on the same chip as the processor cores, there is a performance penalty because of the overhead associated with system bus 225.
Computer system 200 also includes arbiter 245 for arbitrating traffic on system bus 225, a bridge 250 between system bus 225 and a peripheral bus 255, an arbiter 260 for arbitrating traffic on peripheral bus 255, and peripheral cores 265A, 265B, 265C and 265D.
The description of the embodiments of the present invention is given above for the understanding of the present invention. It will be understood that the invention is not limited to the particular embodiments described herein, but is capable of various modifications, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, it is intended that the following claims cover all such modifications and changes as fall within the true spirit and scope of the invention.