DOPIPE

DOPIPE parallelism is a method to perform a loop-level parallelism by pipelining the statements in a loop. Pipelined parallelism may exist at different levels of abstraction like loops, functions and algorithmic stages. The extent of parallelism depends on the programmers’ ability to make the most of this concept. It also depends on factors like identifying and separating the independent tasks and executing them parallelly. [1]

Background

The main purpose of Employing loop-level parallelism is to search and split sequential tasks of a program and convert ’em into parallel tasks Without Any prior information about the algorithm . Parts of data that are recurring and consume significant amount of execution time are good candidates for loop-level parallelism . Some common applications of loop-level parallelism are found in mathematical analysis that uses multiple-dimension matrices which are iterated in nested loops. [2]

There are different kind of parallelization techniques which are used on the basis of data storage overhead, degree of parallelization and data dependencies . Some of the known techniques are: DOALL , DOACROSS and DOPIPE .

DOALL: This technique is used where we can parallelize each iteration of the loop. Hence, the overall run-time is reduced from N * T (for a serial processor, where T is the execution time for each iteration) to only T (since all the N iteration is executed in parallel).

DOACROSS: This technique is used where there is a possibility for data dependencies. Hence, we parallelize tasks in such a way that all the data independent tasks are executed in parallel, but the dependent ones are executed sequentially. There is a degree of synchronization used by the parallel tasks.

Description

DOPIPE is a pipelined parallelization technique that is used in each of the two processes. The following example shows how to implement the technical DOPIPE to reduce the total execution time by breaking the tasks inside the loop and executing them in a pipelined manner. The breaking into tasks takes place in Such a Way That All the dependencies dans le loop are unidirectional, ie The Following iteration does not depend on the previous iteration.

Example

The program below shows a pseudocode [2] for DOPIPE parallelization.

In this code, we see that there are three tasks (F0, F1 and F2) inside a loop iterating over jfor 1 to N. Following is a list of dependencies in the code:

F1 [j] → T F1 [j + 1], implies that j+1statement F1 in iteration j. This is also known as true dependency.

F1 [j] → T F2 [j], implies that jstatement F1 in iteration j.

for (j = 1; j <= N; j ++) {
 F0: o [j] = x [j] - a [j];
 F1: z [j] = z [j-1] * 5;
 F2: y [j] = z [j] * w [j];
}

If this code would have been executed sequentially, then the total time consumed would be equal to N * (T F0 + T F1 + T F2 ), where T F0 , F1 T and T F2 denote execution time for functions F0, F1 and F2 respectively per iteration. Now, if we parallelize the loop by pipelining the statements inside the loop in the following manner:

for (j = 1; j <= N; j ++) {
 F0: o [j] = x [j] - a [j]; // DOALL parallelism
}
for (j = 1; j <= N; j ++) {
 F1: z [j] = z [j-1] * 5; // DOPIPE parallelism
 post (j); // The result of F1 is posted and available for use
}
for (j = 1; j <= N; j ++) {
 wait (j); // It waits till the F1 completes and produces the value z [j] to be used by F2
 F2: y [j] = z [j] * w [j];
}

Since, F0 is an independent function, it does not have any loops carried dependencies (no dependence on j+1or j-1iterations). Neither in the loop. Hence, we can completely separate this function and run it parallel using DOALL parallelism . On the other hand, F1 and F2 statements are dependent (explained above), so we split them into two different loops and execute them into a pipelined fashion. We use post(j)and wait(j)synchronize between F1 and F2 loops.

Starting from the first iteration of j, statement F1 gets Executed in T F1 time. Meanwhile, F2 is not being executed because it is waiting for the value z[j]to be produced by F1. When F1 completes its execution for iteration j, it posts the value using post(j). After waiting for F1’s executing, using wait(j), F2 is starting its execution since it has the value z[j]available for use. Also, since F1 is executed by F2, hence F1 executes j+1simultaneously. The figure below shows the execution timeline of all the statements.

DOPIPE example

From the figure, we see that the total time to execute F0 is T F0 , since all the iterations of F0 are executed in parallel. While for F1 and F2, the total execution time is equal to N * T F1 + F2 (considering negligible synchronization time).

This is much less than the time obtained during sequential execution.

Comparison with other models

DOALL parallelism mainly works on the principle of divide and conquer. Here all the tasks run in different iterations So the problem with this implementation is that when large amount of data is computed together, a large cache space is needed to work for different threads . Since there are no dependencies in the threads , there is no overhead for the inter – thread communication.

While in DOPIPE, there is a synchronization overhead between the threads. But, due to its pipelined structure, it requires less cache space because the data is immediately consumed by the consumer. [2]

See also

  • Parallel computing
  • Loop level parallelism
  • Task parallelism
  • Data dependency
  • OpenMP
  • Automatic parallelization
  • Thread (computing)
  • Cache (computing)

References

  1. Jump up^ Pankratius, Victor; Adl-Tabatabai, Ali-Reza; Tichy, Walter (2011). Fundamentals of Multicore Software Development . CRC press.
  2. ^ Jump up to:c Solihin Yan (2009). Fundamentals of Parallel Multicore Architecture . Chapman and Hall / CRC.

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright computerforum.eu 2018
Shale theme by Siteturner