In computer architecture , register renaming is a technique that eliminates the false data dependencies arising from the reuse of architectural registers by successive instructions that do not have any real data dependencies between them. The elimination of these false data dependencies reveals more instruction-level parallelism in an instruction stream, which can be exploited by various and complementary techniques such as superscalar and out-of-order execution for better performance .
In a register machine , programs are composed of the instructions which operate on values. The instructions must name these values in order to distinguish them from one another. A typical instruction might say, add X and Y and put the result in Z. In this statement, X, Y, and Z are the names of storage locations.
In order to have a compact instruction encoding, the most important instruction sets have a small set of special locations that can be directly named. For example, the x86 instruction set architecture has 8 integer registers, x86-64 has 16, many RISCs have 32, and IA-64 has 128. In smaller processors, the names of these locations correspond directly to the elements of a register file .
Different instructions may take different amounts of time; for example, a processor may be able to execute a few steps in a while. Shorter instructions executed while the load is outstanding, so the instructions are complete. Out-of-order execution has been used in most recent high-performance CPUs to achieve some of their speed gains.
Consider this piece of code running on an out-of-order CPU:
Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor can not finish 4 until 3 is done, otherwise statement 3 would write the wrong value. This restriction can be eliminated by changing the names of some of the registers:
Now instructions 4, 5, and 6 can be executed in parallel with instructions 1, 2, and 3, so that the program can be executed faster.
When possible, the compiler would detect the different instructions and try to assign them to a different register. However, there is a number of names that can be used in the assembly code. Many high performance CPUs have more physical control than they can be called directly in the set instruction, so they are renaming registers in hardware to achieve additional parallelism.
When it comes to an instruction, how do you do it? How do you use it? How do you write it? of data hazards :
- Read-after-write (RAW) – a read from a register This is referred to as a dependency or dependency , and requires the instructions to execute in program order.
- Write-after-write (WAW) – successive writes to a particular register or memory location that must contain the result of the second write. This synonyms can be found in the following keywords : synonyms, canceling, mooting, the first write if necessary. WAW dependencies are also known as output dependencies .
- Write-after-read (WAR) – read from a register or return of the first written value, and not one written programmatically after the read. This is a spell of false dependency that can be resolved by renaming. WAR dependencies are also known as anti-dependencies .
Instead of delaying the writing, the two copies of the lease can be maintained, the old value and the new value. Reads that precedes, in program order, the write of the new value can be provided with the old value, even while other reads that follow the write are provided with the new value. The false dependency is broken and additional opportunities are created. When all reads have been satisfied, it can be discarded. This is the essential concept behind register renaming.
Anything that is read and written can be renamed. While the general-purpose and floating-point registers are discussed,
Memory rentals may also be renamed, but it is not commonly done to the extent practiced in register renaming. The Transmeta Crusoe processor’s gated storage buffer is a form of memory renaming.
If programs refrained from reusing registers immediately, there would be no need for registering renaming. Some instruction sets (eg, IA-64 ) specify very large numbers of registers for this reason. However, there are limitations to this approach:
- It is very difficult to compile to avoid reusing registers without large code size increases. In loops, for instance, successive iterations would have to use different registers, which requires replicating the code in a process called loop unrolling .
- Large numbers of registers require more bits for specifying a register as an operand in an instruction, resulting in increased code size.
- Many instruction sets historically specified smaller numbers of registers and can not be changed now.
Code size is important because the program code is larger, the instruction cache misses more often and the processor stalls waiting for new instructions.
Architectural vs physical registers
Machine language programs specify reads and writes to a limited set of registers specified by the statement set architecture (ISA). For instance, the specified ISA Alpha 32 integer registers, each 64-bit wide, and 32 floating-point registers, each 64-bit wide. These are the architectural registers. Programs written for processors running the Alpha instruction set will specify operations reading and writing those 64 registers. If a program stops the program in a debugger, they can observe the contents of these 64 registers (and a few status registers) to determine the progress of the machine.
One particular processor which implements this ISA, the Alpha 21264 , has 80 integer and 72 floating-point physical registers. There are, on an Alpha 21264 chip, which are able to store the results of the operations, and 72 rentals which can not be ignored. (In fact, there are more than two locations, but those are not
The following text describes two styles of registering which are distinguished by the circuit which holds the data ready for an execution unit.
In all renaming schemes, the machine converts the architectural registers into the stream into tags statement. Where the architectural registers might be specified by 3 to 5 bits, the tags are usually a 6 to 8 bit number. The rename file must be read every day of every renamed lesson. Because the size of the market is large, the rename file is usually broad and consumes significant power.
In the tag-indexed register file style, there is one large register file for data values, containing one register for every tag. For example, if the machine has 80 physical registers, then it would use 7 bit tags. 48 of the possible tag in this case are unused.
In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values corresponding to these tags are read to the execution unit.
In the reservation station style, there are many small associative register files, usually one at the inputs to each execution unit. Each operand of each instruction in an issue has a place in one of these register files.
In this style, when an instruction is issued to an execution unit, the corresponding register file is corresponding to the output of the output unit.
- Architectural Register File or Retirement Register File (RRF)
- The committed register state of the machine. RAM indexed by logical register number. Typically written into the results of a reorder buffer.
- Future File
- The most speculative register state of the machine. RAM indexed by logical register number.
- Active Register File
- The Intel P6 group’s term for Future File.
- History Buffer
- Typically used in combination with a future file. Contains the “old” values of registers that have been overwritten. If the producer is still in RAM it can be indexed by history buffer number. After a branch, they are copied, or the future file is indexed by the register register.
- Reorder Buffer (ROB)
- A structure that is sequentially (circularly) indexed is a per-operation basis, for instructions in flight. It differs from a history buffer because of the reorder buffer typically comes after the future file (if it exists) and before the architectural register file.
Reorder buffers can be data-less or data-ful.
In Willamette’s ROB, the ROB entries in the physical register file (PRF), and also contain other book keeping. This was also the first Out of Order design by Andy Glew, at Illinois with HaRRM.
P6’s ROB, the ROB entries contain data; there is no separate PRF. Data values from the ROB are copied from the ROB to the RRF at retirement.
One small detail: if there is temporal locality in ROB entries (ie, if instructions close together in the von Neumann instruction sequence write back close together in time, it may be possible to perform separate ROB / PRF would). It is not clear if it makes a difference, since a PRF should be banked.
ROBs usually do not have associative logic, and certainly none of the ROBs designed by Andy Glew have CAMs. Keith Diefendorff insisted that ROBs have complex associative logic for many years. The first ROB proposal may have had CAMs.
Details: tag-indexed register file
This is the renaming style used in the MIPS R10000 , the Alpha 21264 , and the FP section of the AMD Athlon .
In the renaming stage, every architectural register referenced (for read or write) is looked up in an architecturally-indexed remap file . This file returns a tag and a ready bit. The tag is not-ready if there is a queued statement that will not be executed. For read operands, this tag takes the place of the architectural register in the instruction. For each register write, a new tag is pulled from a free tag FIFO, and a new mapping is written in the remap file, so that future instructions will read the architectural register. The tag is marked as unready, because the statement has not yet executed. The previous physical register is allocated for this architectural register is saved with the instruction in the buffer reorder, which is a FIFO that holds the instructions in program order between the decode and graduation stages.
The instructions are then placed in various issues queues . As instructions are executed, the tags for their results will be broadcast, and the issue will be queues match these tags against the tags of their non-ready source operands. A match means that the operand is ready. The remap file matches these tags, so it can mark the corresponding physical registers as ready. When all the operands of an instruction are in progress, that instruction is ready to issue. The issue of queues picks each cycle. Non-ready instructions stay in the issue queues. This unordered removal of the queues can make them large and power-consuming.
Issued instructions read from a tag-indexed physical register file (bypassing just-broadcast operands) and then execute. Execution results are written to a tag-indexed physical register file, which is broadcast by the network prior to each functional unit. Graduation puts the previous tag for the written architectural register in the free queue so that it can be reused for a newly decoded instruction.
An exception or branch misprediction causes the remap file to be back to the remap state at last valid statement via combination of state snapshots and cycling through the previous tags in the order pre-graduation queue. Since this mechanism is required, and since it can recover any state remap (not just the state before the instruction being currently being graduated), it can be handled before the branch reaches graduation, potentially hiding the branch misprediction latency.
Details: reservation stations
This is the style used in the AMD K7 and K8 designs section.
In the renaming stage, every architectural register is referenced for reads and is looked up to the architecturally-indexed future file and the rename file. The future file reads the value of that register, if there is no outstanding statement yet to write to it (ie, it’s ready). When the instruction is placed in a queue, the values are in the future. Register writes in the instruction cause a new, non-ready tag to be written in the rename file. The tag number is usually in order FIFO is necessary.
Just as with the tag-indexed scheme, the issue is queuing for non-ready operands to see matching tag broadcasts. Unlike the tag-indexed scheme, matching tags causes the corresponding broadcast
Issued instructions read their arguments from the station reservation, bypass just-broadcast operands, and then execute. As mentioned earlier, the reservation station register files are usually small, with perhaps eight entries.
The results are written to the reorder buffer , to the reservation stations, and to the future file .
Graduation copies the value of the buffer in the architectural register file. The sole use of the architectural register is to recover from exceptions and branch mispredictions.
Exceptions and branch mispredictions, recognised at graduation, cause the architectural file to be copied to the future file, and all registers marked as ready in the rename file. There is usually no way in the world for the future of intermediate education between graduation and decoding, so there is usually no way to do this.
Comparison between the schemes
In both schemes, instructions are inserted in the order of queues, but are removed out-of-order. If the queues do not collapse empty slots, then they will either have many unused entries, or require some sort of variable priority encoding. Queues that collapse holes have simpler priority encoding, but require simple but large circuitry to advance instructions through the queue.
Reservation stations have a better record of success, because of the experience of establishing the value of the book, rather than finding the value of the book. This latency shows up as a component of the branch misprediction latency.
Reservation stations also have a better way of getting results, because each local register file is smaller than the wide central file of the tag-indexed scheme. Tag generation and exception processing are also simplified in the reservation station scheme, as discussed below.
The physical register files used by reservation stations usually collapsed in their entirety and in their entirety. . Each station can be written by every result, so a reservation-station machine with, eg, 8 issues machine. Consequently, result forwarding consumes much more power and area than in a tag-indexed design.
Furthermore, the reservation station scheme has four places where the result can be stored, the tagged scheme has just one (the physical register file). Because the results of the functional units, broadcast to all these storage locations, this function is more important, area, and time. Still, in the case of a machine with a fixed line, the most important conditions are that of a major concern, reservation stations can work remarkably well.
The IBM System / 360 Model 91 was an early machine that supported out-of-order execution of instructions; it used the Tomasulo algorithm , which uses register renaming.
The POWER1 is the first microprocessor that used registering renaming and out-of-order execution in 1990.
The original R10000 had no collapsing issue nor did it have any variables, but it had not been possible to resolve the problem. been issued. Later revisions of the design starting with the R12000 used a partial variable variable encoder to mitigate this problem.
Early out-of-order machines did not separate the renaming and ROB / PRF storage functions. For that matter, some of the earliest, such as Sohi’s RUU or the Metaflow DCAF, combined scheduling, renaming, and storage all in the same structure.
Most modern machines do renaming by RAM indexing a map with the logical register number. Eg, P6 did this; future files do this, and have data storage in the same structure.
However, earlier machines used content-addressable memory (a type of hardware that provides the functionality of an associative array ) in the renamer. Eg, the HPSM RAT, or Register Table Alias, essentially used in CAM on the logical register in combination with different versions of the register.
In many ways, the story of out-of-order microarchitecture has been how these CAMs have been progressively eliminated. Small CAMs are useful; wide CAMs are impractical. [ quote needed ]
The P6 microarchitecture was the first microarchitecture by Intel to implement both out-of-order execution and register renaming. The P6 microarchitecture was used in Pentium Pro, Pentium II, Pentium III, Pentium M, Core, and Core 2 microprocessors. The Cyrix M1 , released on October 2, 1995,  was the first x86 processor to use register renaming and out-of-order execution. Other x86 processors (such as NexGen Nx686 and AMD K5 ) released in 1996 also featured register renaming and out-of-order execution of RISC μ-operations (rather than native x86 instructions). 
- Jump up^ “Cyrix 6×86 Processor” .
- Jump up^ “NexGen Nx686” . , “PC Mag Dec 6, 1994” .