Simultaneous multithreading

Simultaneous multithreading ( SMT ) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading . SMT permits multiple independent threads of execution to be used by the modern processor architectures .


The name multithreading is ambiguous, Because not only multiple threads can be Executed Simultaneously on one CPU core, goal aussi multiple tasks (with different page tables , different task state segments , different protection rings , different I / O permissions , etc.). Although running on the same core, they are completely separated from each other. Multithreading is similar to preemptive multitasking but is implemented at the level of execution in modern superscalar processors.

Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading(also known as super-threading). In temporal multithreading, only one thread of instructions can be executed in any given pipeline at a time. In simultaneous multithreading, instructions from more than one thread can be executed in any given pipeline at a time. This is done without a change to the core processor: the main additions needed to the ability to fetch instructions from multiple threads in a cycle, and a larger register to hold data from multiple threads. The number of competing threads can be decided by the chip designers. Two concurrent threads per CPU core are common, but some processors support up to eight concurrent threads per core.

Because it is really a technical efficiency that inevitably increases the conflict on shared resources, measuring or agreeing on its effectiveness can be difficult. However, measured energy efficiency of SMT with parallel native and managed workloads on historical 130nm to 32nm Intel SMT ( hyper-threading ) implementations found that in 45nm and 32nm implementations, SMT is extremely energy efficient, even with inorder Atom processors [ ASPLOS’11]. In modern systems, SMT effectively exploits concurrency with very little additional dynamic power. That is, even when performance gains can be considerable. quote needed ]

Some researchers have shown that the extra threads can be used to share the same resources, and to claim that this is not enough. Others use SMT to provide redundant computation, for some level of error detection and recovery.

However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations for hardware.


In processor design, there are two ways to increase on-chip parallelism with fewer resources: one is superscalar technique which tries to exploit instruction level parallelism (ILP); the other is multithreading approach exploiting thread level parallelism (TLP).

Superscalar means multiple executions at the same time while thread-level parallelism (TLP) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:

  • Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as temporal multithreading . It can be further divided into fine-grained multithreading or coarse-grained multithreading depending on the frequency of interleaved issues. Fine-grained multithreading-such-in-a- barrel processor -issues instructions for different threads after each cycle, while coarse-grained multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events. ). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel’s Montecitoprocessor uses coarse-grained multithreading, while Sun’s UltraSPARC T1 uses fine-grained multithreading. For those processes that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one per cycle statement.
  • Simultaneous multithreading (SMT): Multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
  • Chip-level multiprocessing (CMP gold multicore ): integrates two or more processors into one chip, each executing threads independently.
  • Any combination of multithreaded / SMT / CMP.

The key factor to distinguish between the two processes and the processors. For example, Sun Microsystems’ UltraSPARC T1 (known as “Niagara” until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading techniques.

Historical implementations

While multithreading CPUs have been around since the 1950s, IBM has also been working on IBM in 1968 as part of the ACS-360 project. [1] The first major commercial microprocessor developed with SMT was the Alpha 21464(EV8). Was this microprocessor developed by December in coordination with Tullsen Dean of the University of California, San Diego, and Susan Eggers and Henry Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before HP acquired Compaq which had in turn acquired DEC. Dean Tullsen’s work was also used to develop the Hyper-threading (Hyper-threading technology or HTT) versions of the Intel Pentium 4 microprocessors, such as the “Northwood” and “Prescott”.

Modern commercial implementations

The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and as part of their processors. Intel calls the functionality Hyper-threading , and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement [2] compared against an otherwise identical, non-SMT Pentium 4. The performance improvement is very application-dependent; However, when running both programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when hyper-threading is turned on. [3] This is due to the replay systemof the Pentium 4 tying up valuable execution resources, increasing contention pour ressources ou as bandwidth, caches, TLBs , re-order buffer entries, equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium 4 Prescott Core has a replay queue, which is needed for the replay system. This is enough to completely overcome that performance hit. [4]

The latest Imagination Technologies MIPS architecture designs include an SMT system known as “MIPS MT” . [5] MIPS MT provides for heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI , a Cupertino-based startup, is the first MIPS vendor to provide a SOC based on eight cores, each of which four threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities. Imagination Technologies MIPS CPUs have two SMT threads per core.

IBM The Gene Blue / Q has 4-way SMT.

The IBM POWER5, announced in May 2004, comes with either a dual-core dual-chip module (DCM), quad-core gold or an oct-core multi-chip module (MCM), with each core including a two-threaded SMT engine. IBM’s implementation is more sophisticated than the previous ones, because it can be assigned to different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM’s second multithreading hardware implementation. In 2010, IBM released systems based on the POWER7 processor with eight cores with four Simultaneous Intelligent Threads. This switches the threading mode between one thread and two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput. IBMPOWER8 has 8 intelligent simultaneous threads per core (SMT8).

IBM z13 has two threads per core (SMT-2).

Although many people reported that Sun Microsystems ‘ UltraSPARC T1 (known as “Niagara” until its 14 November 2005 release) and the now defunct processor codenamed ” Rock “ (originally announced in 2005, but after many delays in 2009) are implementations of SPARCMostly, we are using SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as “CMT”, and the overall concept as “Throughput Computing”. The Niagara has eight cores, but it is actually doing so, so it uses fine-grained multithreading. Unlike SMT, where the instructions are used for each thread, the processor uses a round robin policy. This makes it more similar to a barrel processor. Sun Microsystems’ Rock processor is different, it has more complex cores than more than one pipeline.

The Oracle Corporation Sparc T3 has eight fine-grained threads per core, Sparc T4, Sparc T5, Sparc M5, M6 and M7 have eight fine-grained threads per core of which two can be executed simultaneously.

Fujitsu Sparc64 VI has coarse-grained Vertical Multithreading (VMT) Sparc VII and newer have 2-way SMT.

Intel Itanium Montecito used coarse-grained multithreading and Tukwila and newer use 2-way SMT (with Dual-domain multithreading).

Intel Xeon Phi has 4-way SMT (with Time-multiplexed multithreading) with hardware based threads which can not be disabled Uncontrolled Hyperthreading. [6] The Intel Atom , released in 2008, is the first Intel product to feature 2-way SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the Nehalem microarchitecture , after its absence on the Core microarchitecture .

AMD Bulldozer microarchitecture FlexFPU and Shared L2 are multithreaded but they are only partially threaded, so it is only a partial SMT implementation. [7] [8]

AMD Zen microarchitecture has 2-way SMT.

VISC architecture [9] [10] [11] uses the Virtual Software Layer (translation layer) to dispatch a single thread of instructions to the Global Front End which splits instructions into virtual hardware threadletswhich are then dispatched to separate virtual cores. These virtual cores can then be sent to the physical world. Multiple virtual cores can push threads into the reorder buffer of a single physical core. Each virtual core keeps track of the position of relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic at a single cycle level (1-4 cycles depending on the need for individual applications.


Depending on the design and architecture of the processor, they can be used for performance. [12] Critics argue that it is a very important task for them to be able to test their performance in a variety of ways. Current operating systems lack convenient APIs for this purpose and for prevention processes. [13]

There is also a security concern with certain simultaneous multithreading implementations. Intel’s Hyperthreading in NetBurst based processors has a vulnerability in which it is possible for one application to steal a cryptographic keyfrom another application running in the same processor by monitoring its cache use. [14]

See also

  • Speculative multithreading
  • Symmetric multiprocessing
  • Temporal multithreading
  • Hardware scout


  1. Jump up^ Smotherman, Mark (25 May 2011). “End of IBM ACS Project” . School of Computing, Clemson University . Retrieved January 19, 2013 .
  2. Jump up^ Marr, Deborah (February 14, 2002). “Hyper-Threading Technology Architecture and Microarchitecture” (PDF) . Intel Technology Journal . 6(1): 4. doi : 10.1535 / itj . Retrieved 25 September 2015 .
  3. Jump up^ “CPU performance evaluation Pentium 4 2.8 and 3.0” .
  4. Jump up^ “Replay: Unknown Features of the NetBurst Core . Replay: Unknown Features of the NetBurst Core . . Retrieved 24 April 2011 .
  5. Jump up^ “MIPS MT ASE description” .
  6. Jump up^ Barth, Michaela; Byckling, Mikko; Ilieva, Nevena; Saarinen, Sami; Schliephake, Michael (18 February 2014). Weinberg, Volker, ed. “Best Practice Guide Intel Xeon Phi v1.1” . Partnership for Advanced Computing in Europe.
  7. Jump up^ “AMD Bulldozer Family Module Multithreading” . wccftech. July 2013.
  8. Jump up^ Halfacree, Gareth (28 October 2010). “AMD FP Flex Awnings” . bit-tech.
  9. Jump up^ Cutress, Ian (12 February 2016). “Soft Machine Examining Architecture: An Element of VISC to Improving IPC” . AnandTech.
  10. Jump up^ . Missing or empty( help ) |title=
  11. Jump up^ . Missing or empty( help )|title=
  12. Jump up^ “Replay: Unknown Features of the NetBurst Core . Replay: Unknown Features of the NetBurst Core . . Retrieved 24 April 2011 .
  13. Jump up^ How good is hyperthreading?
  14. Jump up^ Hyper-Threading Considered Harmful

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright 2019
Shale theme by Siteturner