Memory disambiguation is a set of techniques used by high-performance out-of-order executing microprocessors that execute memory access instructions (loads and stores) out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic within the microprocessor core, detect true dependencies between memory operations and the ability to recover. They also eliminate spurious memory dependencies and allow for greater level of readability.
When attempting to execute instructions out of order, a microprocessor must respect true dependencies between instructions . For example, consider a simple true dependence:
1: add $ 1, $ 2, $ 3 # R1 <= R2 + R3 2: add $ 5, $ 1, $ 4 # R5 <= R1 + R4 (dependent on 1)
In this example, the
addinstruction on line 2 is dependent on the
addinstruction on line 1 Because The register R1 is a source operand of the addition operation on line 2. The
addon line 2 can not execute up to the
addon line 1 completes. In this case, the dependence is static and easily determined by a microprocessor, because the sources and destinations are registers. The destination register of the
addinstruction on line 1 (
R1) is part of the instruction encoding, and can be determined by the microprocessor early on, during the decode stage of the pipeline. Similarly, the source registers of the
addinstruction on line 2 (
R4) are also encoded into the statement itself and are determined in decode. To respect this dependence, the microprocessor’s scheduler logic will issue these instructions in the correct order (instruction 1 first, followed by instruction 2) so that the results of 1 are available when instruction 2 needs them.
Complications arise when the dependence is not statically determinable. Such non-static dependencies arise with memory instructions (loads and stores) because the location of the operand can be determined by the encoding itself.
1: store $ 1, 2 ($ 2) # Mem [R2 + 2] <= R1 2: load $ 3, 4 ($ 4) # R3 <= Mem [R4 + 4] (possibly depending on 1, possible same address as above)
Here, the store instruction writes a value to the memory location specified by the value in the address (R2 + 2), and the load statement reads the value at the memory location specified by the value in address (R4 + 4). The microprocessor may not be statically determined, but it is different from the same location, because the locations depend on the values in R2 and R4. If the locations are different, the instructions are independent and can be successfully executed out of order. However, if the locations are the same, then the load is dependent on the store to produce its value. This is known as an ambiguous dependence .
Out-of-order execution and memory access operations
Executing loads and stores out of order can produce incorrect results if a dependent load / store has been executed out of order. Consider the following code snippet, given in MIPS assembly :
1: div $ 27, $ 20 2: sw $ 27, 0 ($ 30) 3: lw $ 08, 0 ($ 31) 4: sw $ 26, 0 ($ 30) 5: lw $ 09, 0 ($ 31)
Assume that the scheduling logic will be an instruction to the execution unit when all of its register operands are ready. Further That assumes registers
$31are ready: the gains in
$31Were computed long time ago and have-nots changed. However, assume
$27is not ready: its value is still in the process of being computed by the
div(integer divide) statement. Finally, assume that you have the same value
$31the same value and access to the same memory word.
In this situation, the
sw $27, 0($30)instruction on line 2 is not ready to execute, but the
lw $08, 0($31)instruction on line 3 is ready. If the processor allows the
lwinstruction to be executed before
sw, the load will read an old value from the memory system; however, it should have read the value that was just written there by the
sw. The load and the store is out of order, but there was a memory dependence between them that was violated.
Similarly, assume that register
$26 is ready. The
sw $26, 0($30)instruction on line 4
lw $08, 0($31)If it occurs, the
lw $08, 0($31)instruction would have read the wrong value from the memory system Executed.
Characterization of memory dependencies
Memory dependencies come in three flavors:
- Read-After-Write (RAW) dependencies: RAW dependencies arise when a load operation reads a value from memory that is produced by the latter.
- Write-After-Read (WAR) dependencies: WAR dependencies arise when a store operation writes a value to memory that has preceding load reads.
- Write-After-Write (WAW) dependencies: WAW dependencies arise when two store operations write values to the same memory address.
The three dependencies are shown in the preceding code segment (reproduced for clarity):
1: div $ 27, $ 20 2: sw $ 27, 0 ($ 30) 3: lw $ 08, 0 ($ 31) 4: sw $ 26, 0 ($ 30) 5: lw $ 09, 0 ($ 31)
lw $08, ($31)instruction on line 3 has a RAW dependence on the
sw $27, 0($30)instruction on line 2, and the
lw $09, 0($31)instruction on line 5 has a RAW dependence on the
sw $26, 0($30)instruction on line. The stores are the most recent producers to that memory address, and the loads are reading that memory address’s value.
sw $26, 0($30)instruction on line 4 has a WAR dependence on the
lw $08, 0($31)instruction on line 3 since it writes the memory address that the preceding load reads from.
sw $26, 0($30)instruction on line 4 has a WAW dependence on the
sw $27, 0($30)instruction on line 2 since both stores write to the same memory address.
Memory disambiguation mechanisms
Modern microprocessors use the following mechanisms, implemented in hardware , to resolve ambiguous dependencies and recover when a dependence was violated.
Avoiding WAR and WAW dependencies
Values from store instructions are not committed to the memory system (in modern microprocessors, CPU cache ) when they execute. Instead, the store instructions, Including the memory address and store data, are buffered in a store queue up to They reach the point of retirement. When a store withdraws, it then writes its value to the memory system. This avoids the WAR and WAW dependence problems shown in the code snippet above where an earlier load receives an incorrect value from the memory system because
Addition, buffering stores may be necessary to provide instructions to follow instructions that may result in an exception (such as a load of a bad address, divide by zero, etc.) or a conditional branchwhich direction (taken or not taken) is not yet known. If the exception -providing instruction has not been issued or the branch has been predicted incorrectly, the processor will have fetched and executed instructions on a “wrong path.” These instructions should not have been executed at all; Exceptional conditions should be taken into account in the case of speculative instructions. The processor must “throw away” any results from the bad-path, speculatively-executed instructions when it comes to the exception or branch misprediction. The complication for the blind is misplaced.
Thus, without store buffering, may not have been executed. Forcing blinds to wait until branch and outlets are known to reduce aggressiveness and limits ILP ( Instruction level parallelism ) and performance. With store buffering, sales can be made by the seller or the customer. This product is not available at all, but it does not offer the best price. ILP and performance from full out-of-order execution of stores.
Store to load forwarding
Buffering stores until retirement. WAW and WAR dependencies. Consider the following scenario: a store executes and buffers its address and data in the store queue. A few instructions later, a load executes that reads from the same memory address to which the store just wrote. If the load reads its data from the memory system, it would read an old value that would have been overwritten by the previous store. The data obtained by the load will be incorrect.
To solve this problem, processors employ a technique called store-to-load forwarding using the store queue. In addition to buffering stores until retirement, the store has a second purpose: forwarding data from completed but not yet-retired (“in-flight”) stores to later loads. Rather than a simple FIFO queue, the store is really a Content-Addressable Memory (CAM)searched using the memory address. When a load executes, it searches the store for in-flight stores to the same address that are logically earlier in program order. If a matching store exists, the load obtains its data value from that store instead of the memory system. If there is no matching store, the load accesses the memory system as usual; any preceding, matching stores must have already retired and committed to their values. This technique enables you to obtain the data you need.
Multiple stores can be present in the store queue. To handle this case, the store queue is priority encoded to select the latest store Logically That Is Earlier than the load in program order. The determination of qui store is “latest” Can Be Achieved by Attaching Some sort of timestamp to the instructions As They are fetched and decoded, or alternatively by Knowing the relative position (slot) of the load with respect to the oldest and newest Within blinds the store tail.
RAW dependence violations
Detecting RAW dependence violations
Modern out-of-order CPUs can be used to detect RAW dependence violation, but all techniques require tracking in-flight. When a load executes, it accesses the memory system and / or store to obtain its data value, and then its address and data are buffered in a load queue . The load is similar in structure and function to the store, and can be combined with the store in a single structure called a load-store queue , or LSQ . The following techniques are used to detect RAW dependence violations:
Load queue CAM search
With this technique, the load queue, like the store tail, is a CAM searched using the memory access address, and keeps track of all in-flight loads. When a store executes, it searches for the load of the customer. If such a matching load exists, it must be executed before the store and thus read an incorrect, old value from the memory system / store queue. Any instructions that used the load’s value also used bad data. To recover if this violation is detected, the load is marked as “violated” in the retirement buffer. The store stays in the store and retires normally, committing its value to the memory when it retires. However, when the violated load reaches the retirement point, the processor flushes the pipeline and restarts execution from the load instruction. At this point, all the previous stores have committed their values to the memory system. The load instruction will now read the correct value from the memory system, and any dependent instructions will be re-executed using the correct value.
This requires technical associative year search of the load queue is every store execution, qui consumes system power and can Prove to be a difficulty for large load path timing cues. However, it does not require any additional memory (cache) or ports that are executing.
Disambiguation at retirement
With this technique, load instructions that have been executed out-of-order are re-executed when they reach the retirement point. Since the load is now the withdrawal instruction, it has no dependencies on any instruction still in-flight; all stores ahead of it, they are committed to the value of their memory, and they are guaranteed to be correct. The value read from memory at re-execution time is compared to the value obtained when the load first executed. If the values are the same, the original value was correct and no violation has occurred. If the re-execution value is different from the original value, a violation has occurred and the pipeline must be used.
This technique is conceptually simpler than the second one, and it eliminates a second CAM and its power-hungry search. Since the load must re-access the memory system just before retirement, the access must be very fast, so this scheme links on a fast cache. No matter how fast the cache is, however, the second memory system access for every out-of-order load instruction does increase instruction retirement latencyand increases the number of cache accesses that must be performed by the processor. The additional remove-time cache access can be satisfied by re-using an existing cache port; However, this effect creates a problem with performance and reduces the cost of performance. Alternatively, an additional cache may be added for the sake of disambiguation, but this increases the complexity, power, and area of the cache. Some recent work (Roth 2005) has shown that it can not be used in many ways. such a technique would help or reduce such latency and resource contention.
A minor benefit of this scheme (compared to a load-queue search) is that it would not be a problem if it would have caused RAW dependency violation. load’s address) has a data value that matches the data value already in the cache. In the load-tail search scheme, an additional data comparison would need to be added to the load-queue search hardware to prevent such a pipeline flush.
Avoiding RAW dependence violations
CPUs that fully support RAW dependence violations when they occur. However, many CPUs do not support this problem, but they do not support this. This approach offers a lower performance compared to supporting full out-of-order load / store execution, but it can significantly reduce the complexity of the execution core and cache.
The first option, made in the first instance, avoids the possibility of losing business. Another possibility is to effectively break into two operations: address generation and cache access. With these two separate operations, the CPU can only be used to access the memory system. After address generation, there is no longer any ambiguous dependencies since they are known This scheme still allows for some “out-of-orderness”
Memory dependence prediction
Processors That fully supporting out-of-order load / store execution can use an additional, related technology, called Expired memory dependence prediction , to attempt to predict true loads and outbuildings entre blinds before Their addresses are known. Using this technique, the processor is able to prevent a complete loss of performance, which is avoided, and avoids dependence and avoidance of the pipeline flush and the penalty penalty that is incurred. See the memory dependence prediction article for more details.