Two important parts of computer hardware are the processor, which performs computations, and memory, in which programs and data are stored. A processor is also often called a central processing unit (CPU). Modern computer systems adopt a stored programming architecture: all the program instructions are stored in memory together with processed data and are executed sequentially by a processor according to the instructions.
In a traditional single processor computer, a single stream of instructions is generated from the program, and these instructions operate on a single stream of data. [10] called this arrangement a single instruction stream-single data stream (SISD) computer.
On the other hand, a parallel computer system uses several processors, and is realized as a single instruction stream-multiple data stream (SIMD) computer or a multiple instruction stream-multiple data stream (MIMD) computer. SIMD refers to a form of parallel execution in which all processors execute the same operation on different data at the same time, and is often associated with performing the same operation on every element of a vector or array. MIMD refers to parallel execution in which each processor works independently; for example, one processor might update a database file while another processor handles a graphic display.
The fundamental software of a modern computer system is an operating system such as UNIX or Microsoft Windows. They support multiple users and multiple tasks, even on single processor systems, by adopting time-slicing mechanisms, in which a processor executes tasks cyclically. In parallel computer systems, some tasks are executed on different processors simultaneously.
The traditional computer system has a single processor (or CPU) that can access all of the memory (Fig. 8.1). Parallel computers use more than one processor simultaneously for a single calculation task. There are two simple methods to increase the number of available processors in a single system. One method is to add processors to the traditional single processor system without changing other parts. Because all the memory is shared by all processors, such systems are called shared memory systems (Fig. 8.2). An example of a shared memory system is a dual processor personal computer, where the motherboard has two sockets for CPUs. When we mount one CPU, it works as a traditional single processor system. If we mount two CPUs, both processors can access all the memory in the PC, and it works as a shared memory system. A second method is to connect traditional single processor computers by a network. This is called a distributed memory system, because the memory is used by a single processor locally and is ''distributed'' over the whole system (Fig. 8.3). An example of a distributed memory system is a network of workstations, in which each node computer works independently and communicates with the others through a network to solve a single problem.
Integration of shared memory and distributed memory is possible (Fig. 8.4). Network-connected PCs that each have two processors can be considered a distributed shared memory system.
In the simple shared memory realization, all the processors
can access all the memory at the same speed by using a common
memory bus. This is known as a uniform memory access
(UMA)
configuration. Performance in a UMA system is limited by the
memory bus bandwidth; adding processors to the system beyond
some point does not increase performance linearly, because
signals from processors flow on the same memory bus and often
cause collisions. Typically, UMA configurations do not scale
well beyond to
processors.
To improve communication between processors and memory, a non-uniform
memory access (NUMA)
configuration is used. In NUMA systems, all processors have access to
all the memory, but the cost of accessing a specific location in
memory is different for different processors, because different
regions of memory are on physically different buses. Even if we adopt
a NUMA configuration, it is not efficient to use more than
processors in a shared memory system.
A shared memory system is also a symmetric multiprocessor (SMP) system, in which any processor can do equally well any piece of work.
In a shared memory system, a single copy of an operating system is in charge of all the processors and the memory. It usually uses a programming model called ''fork-join''. Each program begins by executing just a single task, called the master. When the first parallel work is reached, the master spawns (or forks) additional tasks (called slaves or workers), which will ''join'' to the master when they finish their work (the middle figure in Fig. 8.5). Such activities can be programmed by using software technologies such as process, thread or OpenMP, which will be explained in the next section.
In a distributed memory system, each node computer is an independent computer that has, at least, processor and memory, and the nodes are connected together by a network. This so called ''network of workstations'' (NOW) is the cheapest way to construct a distributed memory system, because we can utilize many different kinds of workstations available, connected by a network, without adding any new hardware. However, NOW is sometimes ineffective for heavy computation, because, for example, general purpose networks are slow, and nodes may be unexpectedly used for other work, so that it is difficult to schedule them efficiently.
Nowadays, ''Beowulf class cluster computers'' are popular for distributed memory parallel computing ([34]). These are a kind of NOW, but there are slight differences. First, the nodes in the cluster are the same kind of workstation or PC, and are dedicated to the cluster calculation tasks. Typically, node computers share the working directory on the hard disk and have no display or keyboard. The interconnection network is isolated from external networks and is also dedicated to the cluster, and communication among the nodes can be done without further authentication. Operating system parameters are tuned to improve the total performance for parallel computing. All these characteristics help the performance of the parallel computing on the cluster.
Distributed memory systems have no memory bus problem. Each processor can use the full bandwidth to its own local memory without interference from other processors. Thus, there is no inherent limit to the number of processors. The size of the system is constrained only by the network used to connect the node computers. Some distributed memory systems consist of several thousand processors.
As nodes in a distributed memory system share no memory at all, exchange of information among processors is more difficult than in a shared memory system. We usually adopt a message passing programming model on a distributed memory system; we organize a program as a set of independent tasks that communicate with each other via messages. This introduces two sources of overhead: it takes time to construct and send a message from one processor to another, and the receiving processor must be interrupted to deal with messages from other processors.
Available message passing libraries are PVM and MPI . The right figure in Fig. 8.5 shows an execution image of MPI. HPF is also mainly used in distributed memory systems. These libraries are illustrated in the next section.
We expect that the calculation speed increases times if we
use
processors instead of one. We also wish to use
multiprocessor systems just like an ordinary single processor
system. However, some costs are incurred in realizing
parallel computing. They include the non-parallel
characteristics of the problem, communication costs such as
distributing and gathering data and/or programs, the
difficulty of programming for synchronization among executions
and unexpected influences of cache memory. All these factors
reduce the effect of parallelization.
All programming tasks include non-parallelizable or serial
parts, which cannot be executed on several processors, for
example, summarizing calculation results and writing them to
the display or file. Assume the ratio of computing time for
the serial parts to the whole task is (
). If
a single processor requires
time to complete the task,
computation time is used for the parallelizable
task and
computation time is used for the serial
task. If we use
processors, the elapsed time for execution
of the parallelizable task will be at least
,
while the execution time of the serial task remains
. Thus, the ratio of execution time for
processors
to that for one processor,
, which is called the speedup
factor, is
![]() |
Of course, as goes to zero,
converges to
, which
is an ideal situation.
Amdahl's law considers the situation where the task size is
fixed and the number of processors increases. In real
problems, however, we wish to perform larger tasks when the
number of processors increases. For example, assume time
is required for preparing a task, and time
is required for
the (moderate) simulation task. When a parallel computer is
available, we wish to perform more simulations, typically,
times larger simulations than the original ones by
processors. To perform this simulation, a single processor
system requires
time, while the
-processor system
requires
time. The speedup factor is
![]() |
If we divide one task into several small tasks and execute them in parallel, we must wait until all the child tasks have been completed: we must synchronize executions. As the slowest child task determines the total execution time, child tasks should be designed to have almost the same execution times, otherwise some processors may be idle while others have tasks queuing for execution. Techniques that aim to spread tasks among the processors equally are called load balancing and are not easy.
In a shared memory system, exchange of information among
processors is performed by variables stored in the shared
memory. If several tasks use one variable almost
simultaneously, it may cause trouble. Consider two tasks
trying to decrease the value of variable by one. Assume
; task
obtains this value, decreases it and writes
into
. If task
tries to do the same task before task
finishes its work, task
also obtains the value
, and
writes
into
. Thus, the final result is
, although
should have decreased twice. To avoid such
a maloperation, task
must wait until task
finishes. All
parallel computing software can handle this synchronization
problem, typically by using a lock-unlock mechanism.
An important hardware aspect of shared memory systems is cache memory. As the advances in main memory technology do not keep up with processor innovations, memory access is very slow compared with processor speed. In order to solve this problem, another layer of memory has been added between a processor and main memory, called the cache. It is a small amount of very fast, expensive memory, that operates close to the speed of the processor. A separate cache controller monitors memory accesses and loads data and instructions in blocks of contiguous locations from memory into the cache. Once the content of memory is stored in the cache, the processor can operate at full speed by using them. Sometimes, the cache contents are different from the necessary ones. In these cases, the processor is stalled and has to wait while the necessary data is newly loaded from memory into the cache. This mechanism works well in a single processor system.
All processors in a shared memory system have their own caches. Suppose several processors access the same location of memory and copy them into their caches. If one processor changes the value of the memory in that location, other processors should not use the value in their caches. A cache coherence protocol is used to notify this information among caches. A common cache coherence protocol is an invalidate policy; when one copy of memory is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). In shared memory systems, cache coherence is done in the hardware and the programmer need not worry about cache coherence. However, it may cause the slowdown of the calculation speed. Note that caches handle blocks of memory. If one processor writes to one part of the block, copies of the whole block in other caches are invalidated though the actual data is not shared. This is known as false sharing and can damage the performance of the cache in a shared memory system. We are sometimes required to write programs considering the amount of the cache memory in a shared memory system to achieve enough performance.
Distributed memory systems require communication among node computers. Such communication is affected by several factors, including network bandwidth, network latency and communication latency. Network bandwidth is the number of bits that can be transmitted in unit time. Network latency is the time to prepare a message for sending it through the network. Communication latency is the total time to send the message, including software overhead and interface delays. Generally, communication is expensive compared with processor work.
If a problem can be divided into small tasks that are completely independent and require no or very little synchronization and communication, the problem is called ''embarrassingly parallel''. Clearly, embarrassingly parallel problems are particularly suitable for parallel computing.