Single instruction, multiple thread ( SIMT ) is an execution model used in parallel computing where single statement, multiple data (SIMD) is combined with multithreading .
The processors, say a number p of em, sccm to execute many more than p tasks. This is achieved by each processor having multiple “threads” (or “work-items” or “Sequence of SIMD Lane operations”), which are executed in lock-step, and are analogous to SIMD lanes . 
The SIMT execution model has-been Implemented On Several GPUs and is falling for general-purpose computing is graphics processing units (GPGPU), eg Some supercomputers combines CPUs with GPUs.
SIMT was introduced by Nvidia :  
Nvidia’s Tesla GPU microarchitecture (first available November 8, 2006 as implemented in the “G80” GPU chip) introduced the single-instruction multiple-thread (SIMT) execution model where multiple independent threads execute concurrently using a single statement.
ATI Technologies (now AMD ) released a competition on May 14, 2007, the TeraScale 1 -based “R600” GPU chip.
As generalized RAM types (eg DDR SDRAM , GDDR SDRAM , XDR DRAM , etc.) are still relatively low, developers are inevitably coming with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This property may be subject to a property of ‘SIMT’ itself.
SIMT is intended to limit instruction overhead fetching ,  ie the latency that comes with memory access, and is used in modern GPUs (such as those of Nvidia and AMD ) in combination with ‘latency hiding’ to enable high-performance execution despite considerable latency in memory-access operations. This is where the processor is oversubscribed with compute tasks, and is able to quickly switch between tasks. This strategy is comparable to multithreading in CPUs (not to be confused with multi-core ). 
A downside of SIMT execution is the fact that thread-specific control-flow is performed using “masking”, leading to poor utilization where a processor’s threads follow different control-flow paths. For instance, to handle an IF – ELSEwhere different threads of a processor execute different paths, all threads must actually process both paths, but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control is coherent for the threads of a processor, ie they follow the same path of execution. SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor. 
|Nvidia CUDA||OpenCL||Henn & Patt|
|thread||Work-item||Sequence of SIMD Lane operations|
|Warp||Wavefront||Thread of SIMD Instructions|
|Block||Workgroup||Body of vectorized loop|
- General-purpose computing on graphics processing units
- Jump up^ Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation . Elsevier. p. 52.
- Jump up^ “Nvidia Fermi Compute Whitepaper Architecture” (PDF) . http://www.nvidia.com/ . NVIDIA Corporation. 2009 . Retrieved 2014-07-17 . External link in( help )
- Jump up^ “NVIDIA Tesla: Unified Graphics and Computing Architecture”. IEEE Micro . IEEE. 28 : 6 (Subscription required.) . 2008. doi : 10.1109 / MM.2008.31 .
- Jump up^ Rul, Sean; Vandierendonck, Hans; D’Haene, Joris; De Bosschere, Koen (2010). An experimental study on performance portability of OpenCL kernels . Symp. Application Accelerators in High Performance Computing (SAAHPC).
- Jump up^ “Advanced Topics in CUDA” (PDF) . cc.gatech.edu . 2011 . Retrieved 2014-08-28 .
- Jump up^ Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation . Elsevier. pp. 209 ff.