Stream processing

Stream processing is a computer programming paradigm, equivalent to dataflow programming , event stream processing , and reactive programming , [1] which allows some applications to be more easily exploited by a limited form of parallel processing . Such applications can use multiple computational units, Such As the floating point unit was graphics processing unit or field-programmable gate arrays (FPGAs), [2] without Explicitly managing allocation, synchronization, or communication Among Those units.

The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a sequence of data (a stream ), a series of operations ( kernel functions ) is applied to each element in the stream. Kernel functions are usually pipelined , and optimal local on-chip memory is attempted, in order to minimize the loss in bandwidth, accredited to external memory interaction. Uniform streaming , where one kernel is applied to all elements in the stream, is typical. Since the kernel and stream abstractions expose data dependencies, compile tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding, for example, to initiate a direct memory access (DMA) when dependencies become known. The elimination of manual DMA management reduces software complexity, and an associated elimination for I / O hardware, reduces the data area expanse that has to be involved with service by specialized computational units such as arithmetic logic units .

During the 1980s stream processing was explored within dataflow programming . An example is the language SISAL (Streams and Iteration in a Single Assignment Language).


Stream processing is essentially a compromise, driven by a data-centric model that works well for traditional DSP or GPU-type applications (such as image, video and digital signal processing ) such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor design may be tuned for maximum efficiency or a trade-off for flexibility.

Stream processing is especially suitable for: citation needed ]

  • Compute Intensity , the number of arithmetic operations for I / O or global memory reference. In many applications today it is well over 50: 1 and increasing with algorithmic complexity.
  • Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records.
  • Data Locality is a specific type of temporal locality common in signal and media processing applications where it is produced once, or never read again. Intermediate streams passed between kernels and kernel functions can capture this locality directly using the stream processing programming model.

Examples of records within streams include:

  • In graphics, each record might be the vertex, normal, and color information for a triangle;
  • In image processing, each record might be a single pixel from an image;
  • In a video encoder, each record may be 256 pixels forming a macroblock of data; gold
  • In wireless signal processing, each record could be obtained from an antenna.

For each record we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never has a piece of memory that is both readable and writable.

Comparison to prior parallel paradigms

Basic computers started from a sequential execution paradigm. Traditional CPUs are SISD based, which means they are conceptually only one operation at a time. As the computing needs of the world evolved, the amount of data to be managed grew very quickly. It was obvious that the sequential programming model could not cope with the increased need for processing power. Various efforts have been made to find alternative ways to perform massive amounts of computations. The result of those efforts was SIMD , a programming paradigm which can be applied to multiple instances of (different) data. Most of the time, SIMD was used in a SWARenvironment. By using more complicated structures, one could also have MIMD parallelism.

These two paradigms were efficient, real-world implementations were limited to synchronization issues and limited parallelism. Only few SIMD processors survived as stand-alone components; most were embedded in standard CPUs.

Consider a simple program containing two arrays containing 100 4-component vectors (ie 400 numbers in total).

Conventional, sequential paradigm

for ( int i = 0 ; i < 100 * 4 ; i ++ )
 result [ i ] = source0 [ i ] + source1 [ i ];

This is the sequential paradigm that is most familiar. Variations do exist, such as inner loops, structures and such

Parallel SIMD paradigm, packed registers (SWAR)

for ( int el = 0 ; el < 100 ; el ++ ) // for each vector
 vector_sum ( result [ el ], source0 [ el ], source1 [ el ]);

This is actually oversimplified. It assumes the instruction vector_sumworks. Although this is what happens with instruction intrinsics , much information is actually not taken into account here as the number of vector components and their data format. This is done for clarity.

You can see, this method reduces the number of decoded statements from numElements * componentsPerElement to numElements . The number of jump instructions is also decreased. These gains result from the parallel execution of the four mathematical operations.

WHAT register holds a certain amount of data so it is not possible to get more parallelism. The speed up is somewhat limited by the assumption of performing parallel operations (please note this is common for both AltiVec and SSE).

Parallel Stream paradigm (SIMD / MIMD)

// This is a fictional language for demonstration purposes.
elements = array streamElement ([ number , number ]) [ 100 ]
kernel = streamKernel instance ( "@ arg0 [@iter]" ) result = kernel . invoke ( elements )

In this paradigm, the whole dataset is defined, rather than each component being defined separately. Describing the set of data is assumed to be in the first two rows. After that, the result is inferred from the sources and kernel. For simplicity, there is a 1: 1 mapping between input and output data. Applied kernels can also be much more complex.

An implementation of this paradigm can “unroll” to loop internally. This allows forput with the chip complexity, easily utilizing hundreds of ALUs. [1] The elimination of complex data patterns makes this extra power available.

While stream processing is a branch of SIMD / MIMD processing, they must not be confused. Although SIMD implementations can often be used in a “streaming” manner, their performance is not comparable: the model envisions a very different usage pattern that allows greater performance by itself. It has been noted [2] that when applied to such standard CPUs only 1.5x speedup can be achieved. By contrast, ad-hoc stream processors easily reach over 10x performance, mainly attributed to the more efficient memory access and higher levels of parallel processing.

Although there are various degrees of flexibility allowed by the model, stream processors usually impose some limitations on the kernel or stream size. For example, consumer hardware often lacks the ability to perform high-precision maths, lacks complex indirection chains or presents lower limits on the number of instructions that can be executed.


Stanford University stream processing projects included the Stanford Real-Time Programmable Shading Project started in 1999. [3] A prototype called Imagine was developed in 2002. [4] A project called Merrimac ran until about 2004. [5] AT & T also researched stream- enhanced processors as graphics processing units rapidly evolving in both speed and functionality. [3] Since these early days, dozens of stream processing languages ​​have been developed, as well as specialized hardware.

Programming model notes

The most immediate challenge in the realm of hardware is the use of hardware and software, but it will be easy to use in a real-world environment with acceptable performance. Machines like Imagine use a straightforward single-threaded model with automated dependencies, memory allocation and DMA scheduling. This paper is a result of the MIT and Stanford research in finding an optimal layering of tasks between programming, tools and hardware. Programmers beat tools in mapping algorithms to parallel hardware, and tools beat programmers in figuring out smartest memory allocation schemes, etc. Of particular concern are MIMD designs such as Cell, for which the program needs to deal with the application of multiple synchronization and load balancing. Efficient multi-core programming tools are severely lacking today.

A drawback of SIMD programming was the issue of Array-of-Structures (AoS) and Structure-of-Arrays (SoA) . Programmers often wanted to build data structures with a ‘real’ meaning, for example:

 // A particle in a three-dimensional space.
struct particle_t {
 float x , y , z ; // not even an array!
 unsigned byte color [ 3 ]; // 8 bit per channel, RGB only
 float size ;
 // ... and many other attributes may follow ...

What happened is that these structures were then assembled in arrays to keep things nicely organized. This is array of structures(AOS). When the structure is laid out in memory, the compiler will produce interleaved data, in the sense that the structures will be contiguous but there will be a constant offset between, say, the “size” attribute of a structure instance and the same element of the following instance. The offset is on the structure definition and it is possible to compile the policies. There are also other problems. For example, the three position variables can not be SIMD-ized that way, because it is not possible that they will be allocated in continuous memory space. To make sure SIMD operations can work on them, they should be grouped in a packed memory location or at least in an array. Another problem lies in both “color” and “xyz” to be defined in three-component vector quantities. SIMD processors usually have support for 4-component operations only (with some exceptions, however).

These kinds of problems and limitations made SIMD acceleration on standard CPUs quite nasty. The proposed solution, structure of arrays (SoA) follows:

struct particle_t {
 float * x , * y , * z ;
 unsigned byte * colorRed , * colorBlue , * colorGreen ;
 float * size ;

For readers not experienced with C , the ‘*’ before each identifier means to point. In this case, they will be used to point to the first element of an array, which is to be allocated later. For Java programmers, this is roughly equivalent to “[]”. The drawback here is that the various attributes could be spread in memory. To make sure this misses, we’ll have to update the various “reds”, then all the “greens” and “blues”.

For stream processors, the use of structures is encouraged. From an application point of view, all the attributes can be defined with some flexibility. Taking GPUs as reference, there is a set of attributes (at least 16) available. For each attribute, the application of the state of the components and the format of the components. The various attributes are then attached to a memory block, possibly defining a stride between ‘consecutive’ elements of the same attributes, effectively enabling interleaved data. When the GPU begins the stream processing, it will gather a set of parameters (usually this looks like a structure or a “magic global variable”), performs the operationsscatters the results to some memory for processing (or retrieving).

More modern stream processing frameworks provide a FIFO like interface to structure data as a literal stream. This abstraction provides a means to specify data dependencies implicitly while enabling the runtime / hardware to take full advantage of that knowledge for efficient computation. One of the simplest and most efficient stream processing modalities for C ++, is RaftLib . RaftLib Enables linking independent compute kernels together as a data flow graph using C ++ stream operators. As an example:

#include <raft>
#include <raftio>
#include <cstdlib>
#include <string>
hi class : public raft :: kernel { public : hi () : raft :: kernel () { output . addPort < std :: string > ( "0" ); } virtual raft :: kstatus run () { output [ "0" ]. push (
 std :: string ( "Hello World \ n " ) );
 return ( raft :: stop );
main ( int argc , char ** argv )
 / ** instantiate print kernel ** /
 raft :: print < std :: string > p ;
 / ** instantiate hello world kernel ** /
 hi hello ;
 / ** make a map object ** /
 raft :: map m ;
 / ** add kernels to map, both hello and p are executed concurrently ** /
 m + = hello >> p ;
 / ** execute the map ** /
 m . exe ();
 return ( EXIT_SUCCESS );

Models of computation for stream processing

Apart from specifying streaming applications in high-level language. Models of computation (MoCs) also have been widely used as data flow models and process-based models.

Generic processor architecture

Historically, CPUs have been implementing various third-party solutions because of the ever-increasing performance when compared to slow-moving external memory bandwidth. As this gap widened, large amounts of the area were dedicated to hiding memory latencies. Since it is very expensive, it is a very important issue, but it is also very much in the realm of mathematical machinery (as a rough estimate, consider it to be less than 10%).

A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistors dedicated to management is actually very little.

Beginning of a whole system point of view, stream processors usually exist in a controlled environment. GPUs do exist on an add-in board (this seems to apply to Imagine ). CPUs do the dirty job of managing resources, running applications and such.

The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-nozzle have been employed in the past). The exact amount of memory is dependent on the market range. As this is written, there are still 64-bit wide interconnections around (entry-level). Most mid-range models use a fast 128-bit crossbar switch matrix (4 or 2 segments), while high-end models deploy huge amounts of memory (actually up to 512 MB) with a slightly slower crossbar that is 256-bit wide. By contrast, standard processors from Intel Pentium to some Athlon 64 have only a single 64-bit wide data bus.

Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which is most closely matched to a multiple point indirection is a indirection chain , which is however guaranteed to be able to read or write from a specific memory area (inside a stream).

Because of the SIMD nature of the ALUs clusters, read / write operations are expected to be in bulk, so they are optimized for high bandwidth rather than low latency (this is a difference from Rambus and DDR SDRAM , for example). This also allows for efficient memory bus negotiations.

Most (90%) of a stream processor’s work is done on-chip, requiring only 1% of the total data to be stored to memory. This is where the country and country dependencies.

Internally, a stream processor features some clever communication and management circuits but what is interesting is the Stream Register File (SRF). This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks. As a cache-like software-controlled structure to the various ALUs, the SRF is shared between all the different ALU clusters. The key concept and innovation here is with Stanford’s Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer. The dependencies between kernel functions and data is known through the programming model which enables the compiler to perform the flow analysis and optimally pack the SRFs. Commonly, this cache and DMA management can take the majority of a project’s schedule, something the stream processor (or at least Imagine) totally automates. Tests done at Stanford showed that the compiler did a better job at scheduling things than doing things with much effort.

There is proof; There can be a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can be easily exploited because of intra-cluster communication.

To keep those ALUs fetched with data, each ALU is equipped with local register files (LRFs), which are basically its usable registers.

This three-tiered data access pattern, makes it easy to keep the data from slow memories, thus making the silicon implementation highly efficient and power-saving.

Hardware-in-the-loop issues

This section may be confusing or unclear to readers . (January 2008) ( Learn how to remove this template message )

Although it can be reasonably expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this. Communication latencies are actually the biggest problem. Although PCI Express has been upgraded with full-duplex communications, it has a GPU and possibly a generic stream processor. This means it’s usually counter-productive to use for small datasets. Because changing the kernel is a rather expensive operation the stream architecture also incurs penalties for small streams, a behavior referred to as the short stream effect .

Pipelining is a very widespread and widely used on stream processors, with GPUs including pipelines exceeding 200 stages. The cost of switching is always relatively expensive. To avoid these problems at various levels of the pipeline, many techniques have been adopted such as “über shaders” and “texture atlases”. These techniques are game-oriented because of the nature of GPUs, but the concepts are interesting for a generic stream processing as well.


  • The Blitter in the Amiga Commodore is an early (circa 1985) graphics processor capable of combining three sources of 16 component bit vectors. Total input stream bandwidth is up to 42 million bits per second. Output stream bandwidth is up to 28 million bits per second.
  • Imagine, [6] headed by William Dally Professor of Stanford University , is a flexible architecture designed to be both fast and energy efficient. The project, originally conceived in 1996, included architecture, software tools, VLSI implementation and a development board, was funded by DARPA , Intel and Texas Instruments .
  • Another Stanford project, called Merrimac, [7] is aimed at developing a stream-based supercomputer. Merrimac intends to use a network architecture and advanced interconnection networks to provide more performance per unit cost than cluster-based scientific computers built from the same technology.
  • The Storm-1 family from Stream Processors, Inc. , a commercial spin-off of Stanford’s Imagine project, was announced during a feature presentation at ISSCC 2007. The family contains four members from 30 GOPS to 220 16-bit GOPS (Trillions of Operations per second), all fabricated at TSMC in a 130 nanometer process. The devices target the high end of the DSP market Including video conferencing , multifunction printers and digital video surveillance equipment.
  • GPUs are widespread, consumer-grade stream processors [4] designed mainly by AMD and Nvidia . Various generations to be noted from a stream processing point of view:
    • Pre-R2xx / NV2x: no explicit support for stream processing. Kernel operations were hidden in the API and provided too little flexibility for general use.
    • R2xx / NV2x: kernel stream operations have been made under the program for control only for vertex processing (fragments were still using old paradigms). No branching support severely hampered flexibility but some types of algorithms could be run (notably, low-precision fluid simulation).
    • R3xx / NV4x: Flexible branching support for some limitations still exists on the operation and strict recursion depth, as well as array manipulation.
    • R8xx: Supports append / consume buffers and atomic operations. This generation is the state of the art.
  • AMD FireStream brand name for HPC product line targeting
  • Nvidia Tesla brand name for product line HPC
  • The Cell processor from STI , an alliance of Sony Computer Entertainment , Toshiba Corporation , and IBM , is a hardware architecture that can function as a stream processor with appropriate software support. It consists of a controlling processor, the PPE (Power Processing Element, an IBM PowerPC ) and a set of SIMD coprocessors, called Synergistic Processing Elements (SPEs), each with independent program counters and instruction memory, in effect a MIMDmachine. In the native programming model all DMA and program scheduling is left up to the programmer. The hardware provides a fast ring bus among the processors for local communication. Because the memory of this architecture is so small that it is possible to exploit this architecture to a small memory footprint or adhere to a stream programming model. With a suitable algorithm the performance of the Cell can rival that of pure stream processors, however this nearly always requires a complete redesign of algorithms and software.

Stream programming libraries and languages

Most programming languages ​​for stream processors start with Java, or C ++ and add extensions which provide specific instructions to allow application developers to share kernels and / or streams. This also applies to most shading languages , which can be considered as stream programming languages ​​to a certain degree.

Non-commercial examples of stream programming languages ​​include:

  • Ateji PX Free Edition , a simple expression of stream programming, the Actor model, and the MapReduce algorithm on JVM
  • Auto-Pipe , from the Stream Based Supercomputing Lab at Washington University in St. Louis , an application development environment for streaming applications that allows authoring applications for heterogeneous systems (CPU, GPGPU , FPGA). Applications can be developed in any combination of C, C ++, and Java for the CPU. Verilog or VHDL for FPGAs. Cuda is currently used for Nvidia GPGPUs. Auto-Pipe also handles coordination of TCP connections between multiple machines.
  • ACOTES Programming Model: Language from Polytechnic University of Catalonia based on OpenMP
  • Brook language from Stanford
  • CAL Actor Language: a high-level programming language for dataflow actors, which are stateful operators that transform input streams of data objects (tokens) into output streams.
  • Cal2Many has a code generation framework from Halmstad University, Sweden. It takes C, Chisel, parallel C targeting Epiphany architecture, ajava & astruct targeting Ambric architecture, etc.
  • DUP language from Technical University of Munich and University of Denver
  • RaftLib – open source C ++ stream processing template library originally from the Stream Based Supercomputing Lab at Washington University in St. Louis
  • Sh library from the University of Waterloo
  • Shallows , an open source project
  • S-Net coordination language from the University of Hertfordshire , which provides separation of coordination and algorithmic programming
  • StreamIt from MIT
  • Siddhi from WSO2
  • WaveScript Functional Stream Processing, also from MIT.
  • Functional reactive programming could be considered stream processing in a broad sense.

Commercial implementations are either general purpose or related to specific hardware by a vendor. Examples of general purpose languages ​​include:

  • AccelerateEyes ‘ Jacket, a commercialization of a GPU engine for MATLAB
  • Ateji PX Java extension that allows a simple expression of stream programming, the Actor model, and the MapReduce algorithm
  • Floodgate , a stream processor provided with the Gamebryo game engine for PlayStation 3, Xbox360, Wii, and PC
  • OpenHMPP , has “directive” vision of Many-Core programming
  • PeakStream, [8] a spinout of the Brook project (acquired by Google in June 2007)
  • IBM Spade – Stream Processing Declarative Engine Application (B. Gedik, et al., SPADE: the S declarative stream processing engine, ACM SIGMOD 2008.)
  • RapidMind , a commercialization of Sh (acquired by Intel in August 2009)
  • TStreams, [9] [10] Hewlett-Packard Cambridge Research Lab

Vendor-specific languages ​​include:

  • Brook + (AMD hardware optimized implementation of Brook ) from AMD / ATI
  • Compute Unified Device Architecture ( CUDA ) from Nvidia
  • Intel Ct – C for Throughput Computing
  • StreamC from Stream Processors, Inc. , a commercialization of the Imagine work at Stanford

Event-Based Processing

  • Wallaroo
  • WSO2 Stream Processor

Batch File Based Processing, but much lower performance in general

  • Apache Kafka
  • Apache Flink
  • Apache Storm
  • Apache Apex
  • Apache Spark

Stream Processing Services:

  • Amazon Web Services – Kinesis
  • Google Cloud – Dataflow
  • Microsoft Azure – Stream Analytics

See also

  • LMIS
  • Parallel computing
  • Molecular modeling on GPU
  • Vector processor
  • Partitioned global address space
  • Streaming algorithm
  • Data stream mining
  • Dimension reduction
  • Flow-based programming
  • Real Time Streaming Protocol


  2. Jump up^ FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs
  3. Jump up^ Eric Chan. “Stanford Real-Time Programmable Shading Project” . Research group web site . Retrieved March 9, 2017 .
  4. Jump up^ “The Imagine – Image and Signal Processor” . Group web site . Retrieved March 9, 2017 .
  5. Jump up^ “Merrimac – Stanford Streaming Supercomputer Project” . Group web site . Archived from the original on December 18, 2013 . Retrieved March 9, 2017 .
  6. Jump up^ Imagine
  7. Jump up^ Merrimac
  8. Jump up^ PeakStream multicore unveils and CPU / GPU programming solution
  9. Jump up^ TStreams: A Model of Parallel Computation (Technical report).
  10. Jump up^ TStreams: How to Write a Parallel Program (Technical report).

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright 2018
Shale theme by Siteturner