Next: 8.3 Parallel Computing Software Up: 8. Parallel Computing Techniques Previous: 8.1 Introduction

8.2 Basic Ideas

Two important parts of computer hardware are the processor, which performs computations, and memory, in which programs and data are stored. A processor is also often called a central processing unit (CPU). Modern computer systems adopt a stored programming architecture: all the program instructions are stored in memory together with processed data and are executed sequentially by a processor according to the instructions.

In a traditional single processor computer, a single stream of instructions is generated from the program, and these instructions operate on a single stream of data. [10] called this arrangement a single instruction stream-single data stream (SISD) computer.

On the other hand, a parallel computer system uses several processors, and is realized as a single instruction stream-multiple data stream (SIMD) computer or a multiple instruction stream-multiple data stream (MIMD) computer. SIMD refers to a form of parallel execution in which all processors execute the same operation on different data at the same time, and is often associated with performing the same operation on every element of a vector or array. MIMD refers to parallel execution in which each processor works independently; for example, one processor might update a database file while another processor handles a graphic display.

The fundamental software of a modern computer system is an operating system such as UNIX or Microsoft Windows. They support multiple users and multiple tasks, even on single processor systems, by adopting time-slicing mechanisms, in which a processor executes tasks cyclically. In parallel computer systems, some tasks are executed on different processors simultaneously.

8.2.1 Memory Architectures of Parallel Computers

The traditional computer system has a single processor (or CPU) that can access all of the memory (Fig. 8.1). Parallel computers use more than one processor simultaneously for a single calculation task. There are two simple methods to increase the number of available processors in a single system. One method is to add processors to the traditional single processor system without changing other parts. Because all the memory is shared by all processors, such systems are called shared memory systems (Fig. 8.2). An example of a shared memory system is a dual processor personal computer, where the motherboard has two sockets for CPUs. When we mount one CPU, it works as a traditional single processor system. If we mount two CPUs, both processors can access all the memory in the PC, and it works as a shared memory system. A second method is to connect traditional single processor computers by a network. This is called a distributed memory system, because the memory is used by a single processor locally and is ''distributed'' over the whole system (Fig. 8.3). An example of a distributed memory system is a network of workstations, in which each node computer works independently and communicates with the others through a network to solve a single problem.

**Figure 8.1:** Traditional system
$\includegraphics[width=12mm]{text/2-8/Figure1.eps}$

**Figure 8.2:** Shared memory system
$\includegraphics[width=40mm]{text/2-8/Figure2.eps}$

**Figure 8.3:** Distributed memory system
$\includegraphics[width=44mm]{text/2-8/Figure3.eps}$

Integration of shared memory and distributed memory is possible (Fig. 8.4). Network-connected PCs that each have two processors can be considered a distributed shared memory system.

**Figure 8.4:** Distributed shared memory system
$\includegraphics[width=\textwidth]{text/2-8/Figure4.eps}$

8.2.1.1 Shared Memory Systems

In the simple shared memory realization, all the processors can access all the memory at the same speed by using a common memory bus. This is known as a uniform memory access (UMA) configuration. Performance in a UMA system is limited by the memory bus bandwidth; adding processors to the system beyond some point does not increase performance linearly, because signals from processors flow on the same memory bus and often cause collisions. Typically, UMA configurations do not scale well beyond to $20\,$ processors.

To improve communication between processors and memory, a non-uniform memory access (NUMA) configuration is used. In NUMA systems, all processors have access to all the memory, but the cost of accessing a specific location in memory is different for different processors, because different regions of memory are on physically different buses. Even if we adopt a NUMA configuration, it is not efficient to use more than $100\,$ processors in a shared memory system.

A shared memory system is also a symmetric multiprocessor (SMP) system, in which any processor can do equally well any piece of work.

In a shared memory system, a single copy of an operating system is in charge of all the processors and the memory. It usually uses a programming model called ''fork-join''. Each program begins by executing just a single task, called the master. When the first parallel work is reached, the master spawns (or forks) additional tasks (called slaves or workers), which will ''join'' to the master when they finish their work (the middle figure in Fig. 8.5). Such activities can be programmed by using software technologies such as process, thread or OpenMP, which will be explained in the next section.

**Figure 8.5:** Typical parallel computing execution
$\includegraphics[width=117mm]{text/2-8/Figure5.eps}$

8.2.1.2 Distributed Memory Systems

In a distributed memory system, each node computer is an independent computer that has, at least, processor and memory, and the nodes are connected together by a network. This so called ''network of workstations'' (NOW) is the cheapest way to construct a distributed memory system, because we can utilize many different kinds of workstations available, connected by a network, without adding any new hardware. However, NOW is sometimes ineffective for heavy computation, because, for example, general purpose networks are slow, and nodes may be unexpectedly used for other work, so that it is difficult to schedule them efficiently.

Nowadays, ''Beowulf class cluster computers'' are popular for distributed memory parallel computing ([34]). These are a kind of NOW, but there are slight differences. First, the nodes in the cluster are the same kind of workstation or PC, and are dedicated to the cluster calculation tasks. Typically, node computers share the working directory on the hard disk and have no display or keyboard. The interconnection network is isolated from external networks and is also dedicated to the cluster, and communication among the nodes can be done without further authentication. Operating system parameters are tuned to improve the total performance for parallel computing. All these characteristics help the performance of the parallel computing on the cluster.

Distributed memory systems have no memory bus problem. Each processor can use the full bandwidth to its own local memory without interference from other processors. Thus, there is no inherent limit to the number of processors. The size of the system is constrained only by the network used to connect the node computers. Some distributed memory systems consist of several thousand processors.

As nodes in a distributed memory system share no memory at all, exchange of information among processors is more difficult than in a shared memory system. We usually adopt a message passing programming model on a distributed memory system; we organize a program as a set of independent tasks that communicate with each other via messages. This introduces two sources of overhead: it takes time to construct and send a message from one processor to another, and the receiving processor must be interrupted to deal with messages from other processors.

Available message passing libraries are PVM and MPI . The right figure in Fig. 8.5 shows an execution image of MPI. HPF is also mainly used in distributed memory systems. These libraries are illustrated in the next section.

8.2.2 Costs for Parallel Computing

We expect that the calculation speed increases times if we use processors instead of one. We also wish to use multiprocessor systems just like an ordinary single processor system. However, some costs are incurred in realizing parallel computing. They include the non-parallel characteristics of the problem, communication costs such as distributing and gathering data and/or programs, the difficulty of programming for synchronization among executions and unexpected influences of cache memory. All these factors reduce the effect of parallelization.

8.2.2.1 Amdahl's Law

All programming tasks include non-parallelizable or serial parts, which cannot be executed on several processors, for example, summarizing calculation results and writing them to the display or file. Assume the ratio of computing time for the serial parts to the whole task is (). If a single processor requires time to complete the task, computation time is used for the parallelizable task and computation time is used for the serial task. If we use processors, the elapsed time for execution of the parallelizable task will be at least , while the execution time of the serial task remains . Thus, the ratio of execution time for processors to that for one processor, , which is called the speedup factor, is

$\displaystyle \notag S(n) = \frac{t_s}{f t_s + (1-f) t_s / n} = \frac{n}{1 + (n-1)f}\,.$

This equation is known as ''Amdahl's law'' ([1]). When

is large, it converges to

, that is, the effect of parallel computing is limited. For example, if $f=5\,{\%}$ , the maximum possible speedup is

, even if we use an infinite number of processors. This may discourage the use of parallel computing.

Of course, as goes to zero, converges to , which is an ideal situation.

8.2.2.2 Gustafson's Law

Amdahl's law considers the situation where the task size is fixed and the number of processors increases. In real problems, however, we wish to perform larger tasks when the number of processors increases. For example, assume time is required for preparing a task, and time is required for the (moderate) simulation task. When a parallel computer is available, we wish to perform more simulations, typically, times larger simulations than the original ones by processors. To perform this simulation, a single processor system requires time, while the -processor system requires time. The speedup factor is

$\displaystyle \notag S(n) = \frac{s+np}{s+p}\,.$

This equation is called ''Gustafson's law'' ([14]). Note that if we define

, this is as same as Amdahl's law. However, when

becomes large,

becomes large linearly. This means that parallel computing is useful for large-scale problems in which the serial part does not increase as the problem size increases. If

approaches zero,

converges to

, the ideal situation.

8.2.2.3 Other Costs

If we divide one task into several small tasks and execute them in parallel, we must wait until all the child tasks have been completed: we must synchronize executions. As the slowest child task determines the total execution time, child tasks should be designed to have almost the same execution times, otherwise some processors may be idle while others have tasks queuing for execution. Techniques that aim to spread tasks among the processors equally are called load balancing and are not easy.

In a shared memory system, exchange of information among processors is performed by variables stored in the shared memory. If several tasks use one variable almost simultaneously, it may cause trouble. Consider two tasks trying to decrease the value of variable by one. Assume ; task obtains this value, decreases it and writes into . If task tries to do the same task before task finishes its work, task also obtains the value , and writes into . Thus, the final result is , although should have decreased twice. To avoid such a maloperation, task must wait until task finishes. All parallel computing software can handle this synchronization problem, typically by using a lock-unlock mechanism.

An important hardware aspect of shared memory systems is cache memory. As the advances in main memory technology do not keep up with processor innovations, memory access is very slow compared with processor speed. In order to solve this problem, another layer of memory has been added between a processor and main memory, called the cache. It is a small amount of very fast, expensive memory, that operates close to the speed of the processor. A separate cache controller monitors memory accesses and loads data and instructions in blocks of contiguous locations from memory into the cache. Once the content of memory is stored in the cache, the processor can operate at full speed by using them. Sometimes, the cache contents are different from the necessary ones. In these cases, the processor is stalled and has to wait while the necessary data is newly loaded from memory into the cache. This mechanism works well in a single processor system.

All processors in a shared memory system have their own caches. Suppose several processors access the same location of memory and copy them into their caches. If one processor changes the value of the memory in that location, other processors should not use the value in their caches. A cache coherence protocol is used to notify this information among caches. A common cache coherence protocol is an invalidate policy; when one copy of memory is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). In shared memory systems, cache coherence is done in the hardware and the programmer need not worry about cache coherence. However, it may cause the slowdown of the calculation speed. Note that caches handle blocks of memory. If one processor writes to one part of the block, copies of the whole block in other caches are invalidated though the actual data is not shared. This is known as false sharing and can damage the performance of the cache in a shared memory system. We are sometimes required to write programs considering the amount of the cache memory in a shared memory system to achieve enough performance.

Distributed memory systems require communication among node computers. Such communication is affected by several factors, including network bandwidth, network latency and communication latency. Network bandwidth is the number of bits that can be transmitted in unit time. Network latency is the time to prepare a message for sending it through the network. Communication latency is the total time to send the message, including software overhead and interface delays. Generally, communication is expensive compared with processor work.

If a problem can be divided into small tasks that are completely independent and require no or very little synchronization and communication, the problem is called ''embarrassingly parallel''. Clearly, embarrassingly parallel problems are particularly suitable for parallel computing.

Next: 8.3 Parallel Computing Software Up: 8. Parallel Computing Techniques Previous: 8.1 Introduction