The field of invention pertains to the computing sciences and, more specifically, to a processor with transactional capability and logging circuitry to report transactional operations.
Multi-core processors and/or multi-threaded instruction execution pipelines within processing cores have caused software programmers to develop multi-threaded software programs (as opposed to single threaded software programs). Multi-threaded software is naturally complex because of the different processes that concurrently execute. However, multi-threaded software is additionally difficult to debug because of an aspect of “non-determinism” in the manner of its execution. Specifically, a multi-threaded software program may execute differently across two different run-times even if the program starts from an identical input state.
For these reasons “logging” is used to record certain critical junctures in a multi-threaded software program's execution. Processors are presently designed with logging circuits that observe the execution of a processor's software and record certain critical events that the circuits have been designed to detect. If the software program crashes, the log record is analyzed to study the execution of the program leading up to the crash.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Each instance of logging circuitry is assigned a specific region of system memory 103 in which to store its respective chunks. Each hardware thread executed by a particular core is allocated its own respective space within the system memory region allocated to the logging circuitry. Here, as is known in the art, a single instruction execution pipeline can concurrently execute multiple hardware threads (e.g., 8 hardware threads). Moreover, each processing core can contain more than one instruction execution pipeline (e.g.,
Hardware threads are understood to be the threads actively being executed within an instruction execution pipeline. Instruction execution pipelines are typically designed to concurrently execute a maximum/limited number of hardware threads where the maximum/limit is set by the hardware design of the pipeline. A software thread is understood to be a singular stream of program code instructions. The number of software threads supported by a processor can greatly exceed the number of hardware threads. A software thread is recognized as also being a hardware thread when the thread's state/context information is switched into an instruction execution pipeline. The software thread loses its hardware thread status when its state/context is switched out of the instruction execution pipeline. In one embodiment, there is one instance of logging circuitry per hardware thread (for simplicity
In an implementation, a logging circuitry instance (e.g., instance 101_1) is designed to terminate a chunk for a thread on any of the following conditions: 1) a memory race condition; 2) a switch of the thread from an active to a hibernated state; 3) a translation look-aside-buffer (TLB) invalidation; 4) a transition of the thread outside a privilege level it was configured for (e.g., the thread transitions from a “user” privilege level to a “kernel” privilege level in response to an interrupt or exception); 5) the thread attempts to access an un-cacheable memory region. Here, any of the above described events contribute to the non-deterministic manner in which multi-threaded programs execute.
Here, each logging circuitry instance 101_1 through 101_N is coupled to “hooks” 104_1 through 104_N in their respective processing cores 105_1 through 105_N of the processor (e.g., in the vicinity of the instruction execution pipelines 106_1 through 106_N that execute the respective instruction streams of the various software threads) that are designed to detect the looked for chunk termination events. During execution of a particular thread, the various hooks detect a chunk termination event for the thread and report the event to the logging circuitry 101. In response, the logging circuitry 101 formulates a packet consistent with the structure of inset 120 and causes the packet to be written to external memory 103.
One of these hooks within each core is coupled to a memory race detection circuit 107_1 through 107_N. As observed in
A memory race occurs when two different software processes (e.g., two different threads) try to access the same memory location. Here. each thread remembers all memory accesses (addresses) of the current chunk. A chunk is terminated and a new chunk is created when a conflict to one of the addresses remembered by the current chunk is detected (no matter how long this access is in the past).
Notably a race can be caused when two different threads on a same core attempt to access the same memory location or when two different threads on two different cores attempt to access the same memory location. In the case of the later, a first core will snoop a second core's L1 cache. Here, interconnection network 109 is used to transport such snoops.
Each memory race detection circuit 107_1 through 107_N tracks recent read operations and recent write operations (referred to as “read sets” and “write sets”) and compares them to incoming read requests and incoming write requests. A memory race circuit will detect a memory race condition anytime it detects concurrent “read-after-write” (RAW), “write-after-write” (WAW) or “write-after-read” (WAR) operations directed to the same memory address. In various embodiments, the identity of the conflicting address may optionally be included in the chunk (depending on whether larger or smaller chunks are desired) that is recorded for a memory race.
As observed in
By contrast, processing cores that support transactions permit speculative execution well beyond the type of speculative execution discussed above (although the cores of
In an implementation, the execution pipelines 206_1 through 206_N of the processor have enhanced functional units to support instructions (e.g., XACQUIRE and XRELEASE) that permit a software thread to believe it has locked a database as described above. That is, the XACQUIRE instruction when executed announces the beginning of speculative execution and the acquisition of a lock on a database. The XRELEASE instruction when executed announces the end of speculative execution and the release of the lock on a database. Importantly, in an implementation, the underlying hardware of the processor 200 acts more to let the software thread believe it has placed a lock on the database when, in fact, it has technically not locked the entire database, but rather, caused conflict detect hardware 221 within the processor to look for and enforce serial operation between competing threads for a same data item.
Here, it should be clear that permitting a first software thread to lock an entire database can hurt performance if there exists another parallel thread that would like to use the same database. The second thread would have no choice but to wait until the first thread commits its data to the database and releases the lock. In effect, actually locking an entire database would cause two concurrent threads that use the same database to execute serially rather than in parallel.
As such, the XACQUIRE instruction has the effect of “turning on” conflict detect hardware 221 within the processor that understands the database (e.g., system memory or a specific portion thereof) is supposed to “behave as if locked”. This means the conflict detect hardware 221 will permit another process to access the database so long as the access does not compete with the accesses made by the process that executed the XACQUIRE instruction and believes it has acquired a lock (here, a competing access is understood to mean a same memory address). If competing accesses are detected, the thread is “aborted” which causes the transaction's state to flush and the program to return to the XACQUIRE instruction to restart another attempt for the transaction. Here, the conflict detection circuitry 221 detects when another process has attempted to access a same memory location as a transaction that has executed XACQUIRE and is executing within a speculative region of code.
In another implementation, the processor also supports additional instructions that permit more advanced transactional semantics (e.g., XBEGIN, XEND and XABORT). XBEGIN and END act essentially the same as XACQUIRE and XRELEASE, respectively. Here, XBEGIN announces the beginning of speculative execution (turns on conflict detection circuitry 221) and XEND announces the end of speculative execution (turns off conflict detection circuitry 221). Operation is as discussed above except that a transaction abort leaves an error code in control register space 222 (e.g., EAX model specific register space implemented with one or more register circuits) of a core that executed the aborted thread providing more details about the abort (e.g., abort caused by ABORT instruction, transaction may succeed on retry, conflict caused abort, internal buffer overflowed, debug breakpoint was hit, abort occurred during nested transaction).
The information left in the register space 222 can be used to direct program flow after an abort to other than into an automatic retry of the transaction Additionally, the processor may support an instruction (e.g., XABORT) that explicitly aborts the transaction. The XABORT instruction gives the programmer the ability to define other transactional abort conditions other than those explicitly designed into the processor hardware. In the case of XABORT, the EAX register will contain information provided by the XABORT instruction (e.g., describing the event that caused its execution)
Processors providing transactional support add to the complexity of debugging multi-threaded program code. As such, the improved processor 200 of
Additionally, the new hooks 230 will report the existence of an aborted transaction. In response the logging circuitry 201 will terminate a chunk, create a packet that describes the chunk termination and write the packet out to system memory 203. Notably, in this approach, the detection of an abort for logging purposes rides off the conflict detection circuitry 221 within the processing cores 205 that actually detects conflicts for aborting transactions rather than on the memory race detection circuitry 207. The relationship between the conflict detection circuitry 221 and the memory race detection circuitry 207 is discussed in more detail below. In an implementation where the processor includes register space 222 that contains additional information describing an abort (e.g., the aforementioned EAX register space), the additional hooks 230 are further designed to report the information contained in the register space 222 to the logging circuitry 201. In processors that support an instruction that explicitly terminates a transaction (e.g., XABORT), a transaction abort packet will also be created and reported out (e.g., with EAX register content if available).
An additional improvement over and above the packet structure of
In an implementation, the TSW contains the contents of the (e.g., EAX) control register in the case of a transaction abort, or, the contents of a “transaction nested counter” register (not depicted) in the case of a transaction start or transaction end. In the case of a transaction abort, in an embodiment, the contents of the EAX control register indicate: 1) if the abort is from an XABORT instruction; 2) whether the transaction may succeed on retry; 3) if the abort is from a conflict; 4) if the abort is from an overflow; 5) if the abort is from a debug breakpoint; 6) whether the aborted transaction is nested. For nested transactions, the processor is designed to support a string of transactions within a transaction (e.g., a first transaction can start another transaction and so on). The transaction nested counter value within its reserved register space essentially keeps track of which inner transaction (if any) the current transaction pertains to.
In an implementation, the memory race detection circuitry 207 (part of the prior art logging technology of
Also, the TSW information of a chunk termination packet can include information pertaining to an abort as to whether or not the memory race detection circuitry 207 detected any conflicts. If not, it is suggestive that the conflict detection circuitry 221 that aborted the transaction actually experienced a “false positive” conflict. In an implementation, false positives are possible at the conflict detection circuitry 221 because of the fact that caches (such as an L1 cache) use hashing circuits to determine where a cached item of data is to be stored and, typically, multiple different memory addresses can hash to a same caching storage location. In a further implementation, the memory race detection circuitry 207 is also capable of generating false positives for similar reasons—although the hashing and storage of memory addresses can be different in the memory race detection circuitry (e.g., a bloom filter is used to keep the read and write sets and memory addresses are hashed to a specific bloom filter location) than in the caching circuitry where the transaction conflict detection circuitry 221 resides. As such, in this case, if the memory race detection circuitry reports any conflicts they cannot be completely relied upon for detecting transactional aborts.
In a further embodiment, the CTR information of a transaction related chunk termination packet indicates whether the transaction was terminated because of a late lock acquire (LLA). A late lock acquire is a special circumstance that permits a transaction to commit its data even though the transaction has not completed. Typically LLAs are imposed when the transaction needs to be “paused”, e.g., in response to an exception or unsafe instruction so that its state can be externally saved. After the transaction's state is externally saved, the transaction resumes normal operation. In this case, again, hooks within the processing cores report out the occurrence of any LLA event to the logging circuitry 201 which reports out a chunk termination event pertaining to the LLA and its termination of the transaction.
The logging circuitry 201 can be implemented in any number of ways. At a first extreme the logging circuitry 201 can be implemented completely in dedicated, custom logic circuitry. At another extreme the logging circuitry 201 can be implemented as a micro-controller or other form of program code execution circuitry that executes program code (e.g., firmware) to perform its various functions. Other blends between these two extremes are also possible.
As any of the logic processes taught by the discussion above may be performed with a controller, micro-controller or similar component, such processes may be implemented with program code such as machine-executable instructions that cause a machine that executes these instructions to perform certain functions.
It is believed that processes taught by the discussion above may also be described in source level program code in various object-orientated or non-object-orientated computer programming languages. An article of manufacture may be used to store program code. An article of manufacture that stores program code may be embodied as, but is not limited to, one or more memories (e.g., one or more flash memories, random access memories (static, dynamic or other)), optical disks, CD-ROMs, DVD ROMs, EPROMs, EEPROMs, magnetic or optical cards or other type of machine-readable media suitable for storing electronic instructions. Program code may also be downloaded from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a propagation medium (e.g., via a communication link (e.g., a network connection)).
The above description describes a processor, including: memory access conflict detection circuitry to identify a conflict pertaining to a transaction being executed by a thread that believes it has locked information within a memory; logging circuitry to construct and report out a packet if the memory access conflict detection circuitry identifies a conflict that causes the transaction to be aborted. In an embodiment the processor includes register space to store information pertaining to the transaction's abort. In an embodiment the packet includes information from the register space. In an embodiment the information indicates that the transaction was aborted because the memory access conflict detection circuitry detected a conflict. In an embodiment the processor comprises a memory race detection circuit to detect memory races, the logging circuitry to construct and report out a packet if the memory race detection circuit detects a memory race. In an embodiment the processor is designed to permit the logging circuitry to be concurrently responsive to both the memory access conflict detection circuitry and the memory race detection circuit. In an embodiment the processor supports an instruction that explicitly aborts a transaction, the logging circuitry to report out a second packet if the instruction is executed. In an embodiment the processor supports an instruction that marks the beginning of a transaction, the logging circuitry to report out a second packet if the instruction is executed. In an embodiment the processor supports an instruction that marks the end of a successfully completed transaction, the logging circuitry to report out a second packet if the instruction is executed.
A method is described that includes: executing an instruction that marks the beginning a transaction, the instruction being part of a thread that believes it has a lock on information within a memory; constructing and reporting out a logging packet in response to the executing of the instruction; and, constructing and reporting out a second logging packet in response to the transaction having ended. In an embodiment the transaction has successfully completed and the constructing and reporting out the second packet is responsive to execution of a second instruction that marks successful completion of the transaction. In an embodiment the transaction has been aborted and the constructing and reporting out of the second packet is responsive to execution of a second instruction that explicitly aborted the transaction. In an embodiment the transaction has been aborted because a memory access conflict was detected. In an embodiment the method further includes detecting a memory race while the transaction is executing. In an embodiment the method further includes constructing and reporting out a third logging packet in response to the detection of the memory race.
A computing system, is described that includes: a) a processor, the processor comprising: memory access conflict detection circuitry to identify a conflict pertaining to a transaction being executed by a thread that believes it has locked information within a memory; logging circuitry to construct and report out a packet if the memory access conflict detection circuitry identifies a conflict that causes the transaction to be aborted; and, b) a memory controller coupled to the memory. In an embodiment the processor supports an instruction that explicitly aborts a transaction, the logging circuitry to report out a second packet if the instruction is executed. In an embodiment the processor supports an instruction that marks the beginning of a transaction, the logging circuitry to report out a second packet if the instruction is executed. In an embodiment the processor supports an instruction that marks the end of a successfully completed transaction, the logging circuitry to report out a second packet if the instruction is executed. In an embodiment the processor includes register space to store information pertaining to the transaction's abort and the packet includes information from the register space.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Number | Name | Date | Kind |
---|---|---|---|
5740461 | Jaggar | Apr 1998 | A |
5996064 | Zaidi et al. | Nov 1999 | A |
6009256 | Tseng et al. | Dec 1999 | A |
7188290 | Kassa | Mar 2007 | B2 |
7376798 | Rozas | May 2008 | B1 |
7441107 | Hammond | Oct 2008 | B2 |
7945741 | Shen et al. | May 2011 | B2 |
8312455 | Bell, Jr. | Nov 2012 | B2 |
8539486 | Cain, III | Sep 2013 | B2 |
8544006 | Bell, Jr. | Sep 2013 | B2 |
8806270 | Adir et al. | Aug 2014 | B2 |
8881153 | Giampapa | Nov 2014 | B2 |
9128781 | Kranich et al. | Sep 2015 | B2 |
9298469 | Busaba | Mar 2016 | B2 |
20060236136 | Jones | Oct 2006 | A1 |
20070143276 | Harris | Jun 2007 | A1 |
20070283102 | Corrigan et al. | Dec 2007 | A1 |
20070294702 | Melvin et al. | Dec 2007 | A1 |
20080019209 | Lin | Jan 2008 | A1 |
20080062786 | Kim et al. | Mar 2008 | A1 |
20080098184 | Huras et al. | Apr 2008 | A1 |
20080109641 | Ball et al. | May 2008 | A1 |
20080163220 | Wang et al. | Jul 2008 | A1 |
20090019209 | Shen et al. | Jan 2009 | A1 |
20090043845 | Garza et al. | Feb 2009 | A1 |
20090046851 | Elmegaard-Fessel | Feb 2009 | A1 |
20090077540 | Zhou et al. | Mar 2009 | A1 |
20090138890 | Blake | May 2009 | A1 |
20090164759 | Bell, Jr. | Jun 2009 | A1 |
20090172305 | Shpeisman et al. | Jul 2009 | A1 |
20090172317 | Saha et al. | Jul 2009 | A1 |
20090319739 | Shpeisman et al. | Dec 2009 | A1 |
20100005255 | Kaushik et al. | Jan 2010 | A1 |
20100058034 | Zaks | Mar 2010 | A1 |
20100162250 | Adl-Tabatabai et al. | Jun 2010 | A1 |
20100169623 | Dice | Jul 2010 | A1 |
20100251031 | Nieh et al. | Sep 2010 | A1 |
20100325376 | Ash et al. | Dec 2010 | A1 |
20110010712 | Thober et al. | Jan 2011 | A1 |
20110016470 | Cain, III | Jan 2011 | A1 |
20110022893 | Yang et al. | Jan 2011 | A1 |
20110029101 | Scorsi et al. | Feb 2011 | A1 |
20110172968 | Davis et al. | Jul 2011 | A1 |
20110264959 | Subhraveti | Oct 2011 | A1 |
20110271017 | Shpeisman et al. | Nov 2011 | A1 |
20110276783 | Golla et al. | Nov 2011 | A1 |
20120011491 | Eldar | Jan 2012 | A1 |
20120079246 | Breternitz et al. | Mar 2012 | A1 |
20120174083 | Shpeisman et al. | Jul 2012 | A1 |
20120204062 | Erickson et al. | Aug 2012 | A1 |
20120227045 | Knauth | Sep 2012 | A1 |
20120239987 | Chow et al. | Sep 2012 | A1 |
20130047163 | Marathe et al. | Feb 2013 | A1 |
20130159678 | Koju | Jun 2013 | A1 |
20130205119 | Rajwar | Aug 2013 | A1 |
20130219367 | Zhou et al. | Aug 2013 | A9 |
20130339687 | Greiner et al. | Dec 2013 | A1 |
20130339688 | Busaba | Dec 2013 | A1 |
20140040551 | Blainey et al. | Feb 2014 | A1 |
20140089642 | Gottschlich et al. | Mar 2014 | A1 |
20140115590 | Blainey | Apr 2014 | A1 |
20140281274 | Pokam et al. | Sep 2014 | A1 |
20140298342 | Michael | Oct 2014 | A1 |
20160154648 | Dixon | Jun 2016 | A1 |
Number | Date | Country |
---|---|---|
20130147898 | Mar 2013 | WO |
20130115816 | Aug 2013 | WO |
20140052637 | Apr 2014 | WO |
Entry |
---|
Min Xu et al.; A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording; 2006 ACM; pp. 49-60; <http://dl.acm.org/citation.cfm?id=1168865>. |
David M. Gallagher et al.; Dynamic Memory Disambiguation Using the Memory Conflict Buffer; 1994 ACM; pp. 183-193; <http://dl.acm.org/citation.cfm?id=195534>. |
Maurice Herlihy et al.; Transactional Memory Support for Lock-Free Data Structures; 1993 IEEE; pp. 289-300; <http://dl.acm.org/citation.cfm?id=165164>. |
Scott Rixner et al.; Memory Access Scheduling; 2000 ACM; pp. 128-138; <http://dl.acm.org/citation.cfm?id=339668>. |
George Z. Chrysos et al.; Memory Dependence Prediction using Store Sets; 1998 IEEE; pp. 142-153; <http://dl.acm.org/citation.cfm?id=279378>. |
Sheldon S. L. Chang; Multiple-Read Single Write Memory and Its Applications; 1980 IEEE; pp. 689-694; <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1675650>. |
Sarita V. Adve et al.; Detecting Data Races on Weak Memory Systems; 1991 ACM; pp. 234-243; <https://dl.acm.org/citation.cfm?id=115976>. |
Gilles Pokam et al.; CoreRacer A Practical Memory Race Recorder for Multicore x86 TSO Processors; 2011 ACM; pp. 216-225; <https://dl.acm.org/citation.cfm?id=2155646>. |
Milos Prvulovic; CORD Cost effective (and nearly overhead-free) Order Recording and Data race detection; 2006 IEEE; pp. 236-247; <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1598132>. |
Joseph Devietti et al.; DMP Deterministic Shared Memory Multiprocessing; 2009 ACM; pp. 85-96; <https://dl.acm.org/citation.cfm?id=1508255>. |
John Erickson et al.; Effective Data-Race Detection for the Kernel ; 2010 OSDI; 12 pages; <https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Erickson.pdf>. |
Shantanu Gupta et al; Using Hardware Transactional Memory for Data Race Detection; 2009 IEEE; 11 pages; <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5161006>. |
Extended European Search Report from European Patent Application No. 14194407.4, dated Apr. 17, 2015, 6 pages. |
Notice of Allowance from U.S. Appl. No. 13/729,718, dated Apr. 30, 2015, 8 pages. |
Xu, Min et al., “A Hardware Memory Race Recorder for Deterministic Replay”, Published by the IEEE Computer Society, Jan.-Feb. 2007, pp. 48-55. |
Office action from U.S. Appl. No. 13/844,817, dated Jan. 12, 2015, 23 pages. |
Final Office action from U.S. Appl. No. 13/844,817, dated Jun. 5, 2015, 25 pages. |
Pokam, Gilles et al., “Core Racer: A Practical Memory Race Recorder for Multicore x86 TSO Processors”, MICRO'11, Dec. 3-7, 2011, Copyright 2011; 10 pages. |
Office action from U.S. Appl. No. 13/629,131, dated Jul. 20, 2015, 4 pages. |
Notice of Allowance from U.S. Appl. No. 13/629,131, dated Dec. 17, 2015, 7 pages. |
PCT/US2013/061990 Written Opinion of the International Searching Authority, dated Jan. 29, 2014, 5 pages. |
PCT/US2013/061990 International Search Report, dated Jan. 29, 2014, 3 pages. |
Hower, D.R. et al., “Rerun: Exploiting Episodes for Lightweight Memory Race Recording”, in Proceedings of International Symposium on Computer Architecture, Jun. 2008, 12 pages. |
Pokam, G. et al., “Architecting a Chunk-based Memory Race Recorder in Modem CMPs”, MICRO, 2009, 11 pages. |
PCT/US2013/061990 International Preliminary Report on Patentability, dated Mar. 31, 2015, 6 pages. |
Office action from U.S. Appl. No. 13/844,817, dated Jan. 15, 2016, 25 pages. |
Final Office action from U.S. Appl. No. 13/844,817, dated Jun. 15, 2016, 37 pages. |
Office action from U.S. Appl. No. 13/844,817, dated Mar. 23, 2017, 41 pages. |
Notice of Allowance from U.S. Appl. No. 13/844,817, dated Dec. 22, 2017, 9 pages. |
Office Action from European Patent Application No. 14194407.4, dated Mar. 11, 2016, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20150186178 A1 | Jul 2015 | US |