A register file is an array of processor registers in a central processing unit (CPU). Modern integrated circuit -based register files are usually implemented by way of fast static RAMs with multiple ports. Such ports are distinguished by having dedicated read and write ports, whereas ordinary multiported SRAMs will usually read and write through the same ports.
The instruction set architecture is a set of instructions that are used to determine the data and the functional units on the chip. In simpler CPUs, these architectural registers correspond to one-for-one to the entries in a physical register file (PRF) within the CPU. More complicated CPUs use register renaming , so that the mapping of the physical entry of a particular architectures.  The register file is part of the architecture and visible to the program, as opposed to the concept of transparent caches .
Register bank switching
Register files may be clubbed together as register banks.  Some processors have several register banks.
ARM processors use ARM register banks for fast interrupt request . x86 processors use context switching and fast switch for switching between instruction, decoder, GPRs and register files, if there is more than one, before the instruction is issued, but this is only existing on processors that support superscalar. However, context switching is a totally different mechanism to ARM’s register bank within the registers.
The MODCOMP and the later 8051-compatible processors use the program.
The usual layout convention is that a simple array is read out vertically. That is, a single word line, which runs horizontally, causes a row of bit cells to get their data on bit lines, which run vertically. Sense amps , which convert low-swing to full-swing logic levels, are usually at the bottom (by convention). Larger register files are created by tiling mirrored and rotated simple arrays.
Register files have one word line per entry per port, one bit line per bit of width per read port, and two bit lines per bit of width per write port. Each bit cell also has Vdd and Vss. Therefore, the wire pitch area increases the number of ports, and the transistor area increases linearly.  At some point, it may be smaller and / or faster to have multiple redundant registers, with smaller numbers of read ports, rather than a single register file with all read ports. The MIPS R8000 ‘s integer unit, for example, Had a 9 read 4 write the port 32 entry 64-bit register file Implemented in a 0.7 .mu.m process, qui Could Be seen When looking at the chip from arm’s length.
Two popular approaches to dividing registers in multiple register files are the distributed register configuration file and the partitioned register file configuration. 
In principle, any operation that could be done with a 64-bit-wide register with many read and write ports could be done with a single 8-bit-wide register file with a single read port and a single write port. HOWEVER, the bit-level parallelism of wide register files with Many ports Allows Them to run much faster and THUS, They can do operations in a single cycle That Would take Many cycles with Fewer ports gold narrower width bit or both. 
The width in bits of the register is usually the number of bits in the processor word size . Occasionally it is slightly extra in order to attach “extra” bits to each register, such as the poison bit. If the width of the data is different than the width of an address-or in some cases, such as the 68000 , even when they are the same width-the address registers are in a separate register file than the data registers.
- The decoder is often broken into pre-decoder and decoder proper.
- The decoder is a series of gates that drive word lines.
- There is one decoder per read or write port. If the array has four reads and two write ports, for example, it has 6 word lines per bit in the array, and six gates per row in the decoder. Note that the decoder has to be pitch matched to the array, which forces those gates to be wide and short
The basic scheme for a bit cell:
- State is stored in pairs of inverters.
- Data is read out by nmos transistor to a bit line.
- Data is written by one side or the other to a two-nmos stack.
- So: read ports take one transistor bit by bit, write ports take four.
Many optimizations are possible:
- Sharing lines between cells, for example, Vdd and Vss.
- Read bit lines are often precharged to something between Vdd and Vss.
- Vdd or Vss. A sense amplifier converts this small-swing signal into a full logic level. Small swing signals are a great deal of parasitic capacitance.
- Write bit lines may be braided, so that they couple to read nearby bitlines. Because they are full swing, they can cause significant disturbances on read bitlines.
- If Vdd is a horizontal line, it can be switched off, by yet another decoder, if any of the write ports are writing that line during that cycle. This optimization increases the speed of the write.
- Techniques That Reduce the energy used by register files are Useful in low-power electronics 
Most register files make multiple copies of the same entry. Instead, the instruction scheduling hardware ensures that only one instruction in any particular cycle writes a particular entry. If multiple instructions targeting the same register are issued.
The inverters will take some time to get back to work. It is common to have bypassed multiplexers that bypass written data to read and write. These bypass multiplexers are often part of a larger bypass network that is committed by the network.
The register file is usually pitch-matched to the datapath that it serves. Pitch matching avoids having many busses passing over the datapath turn corners, which would be a lot of area. But since each unit must have the same bit pitch, every unit in the datapath ends up with the bit pitch by the widest unit, which can be used in other units. Register files, because they have two wires for each other, and because they all have a bit of a bit of a datapath.
Area can sometimes be saved with datapaths, by having two side-by-side datapaths, each of which has a smaller bit than a single datapath would have. This case usually forces multiple copies of a register file, one for each datapath.
The Alpha 21264 (EV6), for instance, was the first large micro-architecture to implement “Shadow Register File Architecture”. It had two copies of the integer register and two copies of it (see page 2), and took an extra cycle to propagate data between the two during context switch. GPR in superscalar and speculative execution. The issue of GPR in superscalar and speculative execution. The design was later adapted by SPARC , MIPS and some later x86 implementation.
The MIPS multiple register file uses as well, R8000 floating-point unit HAD two copies of the floating-point register file contents, each with stove and oven read write ports, and copies Both wrote at the time with Sami context switch. However it does not support integer operation and integer register file still remain one. Later shadow register file was abandoned in newer design in favor of embedded market.
The SPARC uses “Shadow Register File Architecture”, it had up to 4 copies of integer register files (future, retired, scaled, scratched, each contain 7 read 4 write port) and 2 copies of floating point register file. but unlike Alpha and x86, they are located in the back of the file.
IBM uses the same mechanism as many major microprocessors, which are deeply dependent on the decoder’s position, and which are different from those of the Alpha and the X86. most of its register file is not only useful for its dedicate decoders only but up to the thread level. For example, POWER8 has up to 8 instruction decoders, but up to 32 register files of 32 general purpose registers each (4 read and 4 write port), to facilitate simultaneous multithreading , which of context switch.).
In the x86 processor line, a typical pre-486 CPU did not have an individual register file, and it was directly associated with its decoder, and the x87 push stack was located within the floating-point unit itself. Starting with Pentium , a typical Pentium-compatible x86 processor is integrated with one copy of the single-port architectural register file containing 8 architectural registers, 8 control registers, 8 debug registers, 8 code registers, 8 unnamed based register, [ clarification needed ] one instruction pointer, one flag register and 6 segment registers in one file.
One copy of 8 x87 FP push down stack by default, MMX register have been virtually simulated from x86 stack and require x86 to supplying MMX statement and aliases to exist stack. One P6, the instruction independently can be stored and executed in parallel in early pipeline stages before decoding into micro-operations and renaming in-out-of-order execution. Beginning with P6, “register buffer”, “register buffer”, “register buffer”, “buffer buffer”, “buffer buffer” and “feedback buffer” (16 bytes). The register file itself still remains one x86 Its x86 register file increased to dual ported bandwidth for result storage. Registers like debug / condition code / control / unnamed / flag were placed between the micro-op ROM and instruction sequencer. Only inaccessible registers like the segment register are now separated from the general-purpose register file (except the instruction pointer); they are now located between the scheduler and instruction allocator, in order to facilitate registering renaming and out-of-order execution. The x87 has been registered in Pentium III, but the XMM register file is still located separately from x86 integer register files.
Later P6 implementations (Pentium M, Yonah) introduced “Shadow Register File Architecture” that expanded to 2 copies of dual ported integer architectural register file and consist with context switch (between future and withdrawn file and scaled file using the same trick that used between integer and floating point). It is in order to solve the problem that it exists in x86 architecture after micro op fusion is introduced, but it is still have 8 entries 32 bit architectural registers for total 32 bytes in capacity per file , but they are inaccessible by program) as speculative file. The second file is served as a scaled shadow register file, which without context switches the scaled file. Some instructions from SSE2 / SSE3 / SSSE3 require this feature for integer operation, like example PSHUFB, PMADDUBSW, PHSUBW, PHSUBD, PHSUBSW, PHADDW, PHADDD, PHADDSW would require loading EAX / EBX / ECX / EDX from both of register, it was uncommon for x86 processor to take advantage of another register with the same statement; most of the time is being removed. The Pentium M architecture still remains one dual-ported FP register file (8 entries MM / XMM) shared with three decoders and the register does not have shadow register file. Processor after P6, the architectural register file is external and locate in processor’s backend after retired, inverse to internal register that is located in inner core for register renaming / reorder buffer. However, in Core 2 it is now a unit called “register alias table” RAT, located with instruction allocator but has the same size of register size as retirement.Core 2increased register of two read / two write (register reads still) 8 entries in the form of two read / two write 32-bit and 32-byte (not including 6-segment register and one statement pointer they are unable to be accessed in the file by any code / statement) in total file size and up to 16 entries in x64 for total 128 bytes file size. From Pentium M and its pipeline port and decoder increased, but they are located with allocator table instead of code buffer. Its FP XMM register file is also increasing to quad ported (2 read / 2 write),
In later x86 implementations, like Nehalem and later processors, both integer and floating point registers are now integrated into a unified octa-ported (6 read and 2 write) general-purpose register file (8 + 8 in 32-bit and 16 + 16 in x64 per file), while the register file extended to 2 with enhanced “Shadow Register File Architecture” in executing favorite of hyper threading and thread Each uses independent register files for icts decode. Later Sandy bridge and onward thumbnail and register with a large number of pixels. Randered that Sandy Bridge.
On the Atomline was the modern simplified revision of P5. It includes single copies of the file share with decoder. The register file is a dual-port design, 8/16 GPRS entries, 8/16 debug register entries and 8/16 entries. However it has an eight-entries 64 bit shadow based register and an eight-bit 64 bit unnamed register GPRs unlike the original P5 design and located after the execution unit, and the file of these registers is single-ported shadow / core2 (shadow register file made of architectural registers and Bonnell did not have a “Shadow Register File Architecture”), however the file can be used for renaming purpose to lack of out of order execution on Bonnell architecture. It also had one copy of XMM floating point register file per thread. The difference fromNehalem is Bonnell do not have a unified register file for its hyper threading. Instead, Bonnell uses a separate rename register for its thread. Similar to Bonnell, Larrabee and Xeon Related files to XMM Register files, and the Xeon Phi has up to 128 AVX-512 register files , each containing 32 512-bit ZMM registers for vector storage instruction, which can be as big as L2 cache.
There are some other Intel’s x86 lines that do not have a register in their internal design, Geode GX and Vortex86 and many embedded processes that are not Pentium- compatible or reverse-engineered early 80×86 processors. Therefore, most of them do not have a register for their decoders, but their GPRs are used individually. Pentium 4GPRs does not exist within its structure, due to the introduction of a physical unified renaming register (similar to Sandy Bridge, but slightly different due to the inability of Pentium 4 to use the register before naming for the architectural register and the x86 decoding scheme. SSE2 / SSE3 / SSSE3 uses SSE2 / SSE3 / SSSE3 instead of using the same mechanism.
AMD ‘s early design like K6 do not have a register file like Intel and do not support “Shadow Register File Architecture” as its lack of context and bypass inverter that are necessary to properly register. Instead they use a separate GPRs that directly link to a rename register table for its OoOE CPU with a dedicated integer decoder and floating decoder. The mechanism is similar to Intel’s pre-Pentium processor line. For example, the K6processor has four digits register one register (one eight-entries) ST line one goes fadd and one goes FMOV) That Directly link with icts x86 EAX for integer renaming and XMM0 register for floating point renaming, aim later Athlon included “shadow register” in its front end, it’s scaled up to 40 entries unified register file register register + 16 unnamed GPRs register file. In later AMD designs it abandons the shadow register design and favored to K6 architecture with individual GPRs direct link design.Phenom , it has three places register and two SSE register files that are located in the physical register file directly linked to GPRs. However, it does not matter to one integer + one floating-point on Bulldozer. Like early AMD designs, most of the x86 manufacturers like Cyrix, VIA, DM & P, and SIS used the same mechanism, resulting in a lack of CPU integration. Companies like Cyrix and AMD had to increase their size. AMD’s SSE integer operation work in a different way than Core 2 and Pentium 4; it uses its separate renaming integer register to load the value directly before the decode stage. Intel is SSE implementation, but it is more likely to be Intel and SSE implementation than SSE, but it would be much less expensive than Intel’s SSE implementation. instruction wide,
Unlike Alpha , Sparc , and MIPS that only allows one register to load / fetch one operand at the time; it would require multiple register files to achieve superscale. The ARMit does not have multiple register files to load / fetch instructions. ARM GPRs do not require accumulator, index, and stack / base points. Any GPRs can propagate and store multiple instructions regardless of its size and size. The major difference between ARM and other designs is that ARM permits to run on the same general-purpose register. Despite x86 sharing the same mechanism with ARMs, it can not be related to data, but it can not be compared to other data. 64 bit, compared to ARM’s 13 in 32 bit and 31 in 64 bit) for data, and it is impossible to have superscalar without multiple register codes (x86 code is big and complex compared to ARM). Because most x86’s front-ends have become much more competitive (eg Pentium M & Core 2 Duo, Bay Trail). Some third-party x86 equivalent processors even became noncompetitive with ARM due to having no dedicated register file architecture. Particularly for AMD, Cyrix and VIA that can not bring out any significant expense in terms of performance, which leaves only Intel Atom to be the only in-order processor. This was the first step in the development of a single file, and the introduction of a large physical register in the front end of the internal market. .
Processors that perform register renaming can arrange for each functional unit to write to a subset of the physical register file. This arrangement can eliminate the need for multiple write ports per bit cell, for large savings in area. The resulting register file, effectively a stack of register files with single write ports, then benefits from replication and subsetting the read ports. At the limit, this technique would have a stack of 1-write, 2-read regfiles and the inputs to each functional unit. Since they are often used in this way, they are often used in this way.
The SPARC ISA defines register windows , in which the 5-bit architectural names of the registers actually point to a larger register file, with hundreds of entries. Implementing multiported register files with hundreds of entries requires a large area. The register window slides by 16 registers when moved, so that each architectural register can refer to a small number of registers in the larger array, eg architectural register r20 can only refer to physical registers # 20, # 36, # 52, # 68, # 84, # 100, # 116, if there are just seven windows in the physical file.
To save area, some SPARC implementations implement a 32-entry register file, in which each cell has seven “bits”. Only one is read and writable through the external ports, but the contents of the bits can be rotated. A rotation accomplished in a single cycle. Because most of the wires are local, tremendous bandwidth is possible with little power.
This same technique is used in the R10000 register renaming mapping file, which stores a 6-bit virtual register for each of the physical registers. In the renaming file, the report is always checked when a report is taken, so that it is detected in a mispredicted, the old-world can be recovered in a single cycle. (See Register renaming .)
- Sum addressed decoder
- ^ Jump up to:a b ” A Survey of Techniques for Designing and Managing CPU Register File ” Concurrency and Computation: Practice and Experience, 2016
- Jump up^ Wikibooks: Microprocessor Design / Register Register File # Bank.
- ^ Jump up to:a b Johan Janssen. “Compile Strategies for Transport Triggered Architectures” . 2001. p. 169. p. 171-173.
- Jump up^ “Energy efficient asymmetrically ported register files”by Aneesh Aggarwal and M. Franklin. 2003.