11.1 Parallel Architectures · Foundations of Algorithms Using C++ Pseudocod

## 11.1 Parallel Architectures The construction of parallel computers can vary in each of the following three ways: 1. Control mechanism 2. Address-space organization 3. Interconnection network ### 11.1.1 Control Mechanism Each processor in a parallel computer can operate either under the control of a centralized control unit or independently under the control of its own control unit. The first kind of architecture is called ***single instruction stream, multiple data stream*** (SIMD). [Figure 11.3](#ch11fig03)(a) illustrates an SIMD architecture. The interconnection network depicted in the figure represents the hardware that enables the processors to communicate with each other. Interoconnection networks are discussed in [Section 11.1.3](#ch11lev2sec5) In an SIMD architecture, the same instruction is executed synchronously by all processing units under the control of the central unit. Not all processors must execute an instruction in each cycle; any given processor can be switched off in any given cycle. [![Click To expand](https://box.kancloud.cn/24d63440e138947e2a41867e14088e0a_350x233.jpg)](fig11-3_0.jpg) Figure 11.3: (a) A sigle instruction stream, multiple data stream, multiple data stream (SIMD) architecture. (b) A multiple instruction stream, multiple data stream (MIMD) architecture. Parallel computers, in which each processor has its own control unit, are called ***multiple instruction stream, multiple data stream*** (MIMD) computers. [Figure 11.3](#ch11fig03)(b) illustrates an MIMD architecture. MIMD computers store both the operating system and the program at each processor. SMID computers are suited for programs in which the same set of instructions is executed on different elements of a data set. Such programs are called ***data parallel programs***. A drawback of SIMD computers is that they cannot execute different instructions in the same cycle. For example, suppose the following conditional statement is being executed: ``` if (x = = y) execute instructions A; else execute instructions B; ``` Any processor that finds *x* (recall that the processors are processing different data elements) must do nothing while the processors that find *x* = *y* are executing instructions A. Those that find *x* = *y* must then be idle while the others are executing instruction B. In general, SIMD computers are best suited to parallel algorithms that require synchronization. Many MIMD computers have extra hardware that provides synchronization, which means that they can emulate SIMD computers. ### 11.1.2 Address-Space Organization Processors can communicate with each other either by modifying data in a common address space or by passing messages. The address space is organized differently according to the communication method used. ### Shared-Address-Space Architecture In a ***Shared-address-space architecture***, the hardware provides for read and write access by all processors to a shared address space. Processors communicate by modifying data in the shared address space. [Figure 11.4](#ch11fig04)(a) depits a shared-address-space architecture in which the time it takes each processor to access any word in memory is the same. Such a computer is called a ***uniform memory access*** (UMA) computer. In a UMA, each processor may have its own private memory, as shown in [Figure 11.4](#ch11fig04)(a). This private memory is used only to hold local variables necessary for the processor to carry out its computations. None of the actual input to the algorithm is in the private area. A drawback of a UMA computer is that the interconnection network must simulatneously provide access for every processor to the shared memory. This can significantly slow down performance. Another alternative is to provide each processor with a portion of the shared memory. This is illustrated in [Figure 11.4](#ch11fig04)(b). This memory is not private, as is the local memory in [Figure 11.4](#ch11fig04)(b). This memory is not private, as is the local memory in [Figure 11.4](#ch11fig04)(a). That is, each processor has access to the memory stored at another processor. However, it has faster access to its own memory than to memory stored at another processor. If most of a processor's accesses are to its own memory, performance should be good. Such a computer is called a ***nonuniform memory access*** (NUMA) computer. [![Click To expand](https://box.kancloud.cn/93e46efe2dc13997fb0d4e859b60a098_350x250.jpg)](fig11-4_0.jpg) Figure 11.4: (a) A uniform memory access (UMA) computer. (b) A nonuniform memory access (NUMA) computer. ### Message-Passing Architecture In a ***message-passing architecture***, each processor has its own private memory that is accessible only to that processor. Processors communicate by passing messages to other processors rather than by modifying data elements. [Figure 11.5](#ch11fig05) shows a message-passing architecture. Notice that [Figure 11.5](#ch11fig05) looks much like [Figure 11.4](#ch11fig04)(b). The difference is in the way in which the interconnection network is wired. In the case of the NUMA computer, the interconnection network is wired to allow each processor access to the memory stored at the other processors, whereas in the message-passing computer it is wired to allow each processor to send a message directly to each of the other processors. [![Click To expand](https://box.kancloud.cn/33c57dffc0dc6aa81d93be58b276d637_298x501.jpg)](fig11-5_0.jpg) Figure 11.5: A message-passing architecture. Each processor's memory is accessible only to that processor. Processors communicate by passing messages to each other through the interconnection network. ### 11.1.3 Interconnection Networks There are two general categories of interconnection networks: static and dynamic. Static networks are typically used to construct message-passing architectures, whereas dynamic networks are typically used to construct shared-address-space architectures. We discuss each of these types of networks in turn. ### Static Interconnection Networks A ***static interconnection network*** contains direct links between processors and are sometimes called ***direct networks***. There are several different types of static interconnection networks. Let's discuss some of the most common ones. The most efficient, but also the most costly, is a ***completely connected network***, which is illustrated in [Figure 11.6](#ch11fig06)(a). In such a network, every processor is directly linked to every other processor. Therefore, a processor can send a message to another processor directly on the link to that processor. Because the number of links is quadratic in terms of the number of processors, this type of network is quite costly. In a ***star-connected network***, one processor acts as the central processor. That is, every other processor has a link only to that processo. [Figure 11.6](#ch11fig06)(b) depicts a star-connected network. In a star-conected network, a processor sends a message to another processor by sending the message to the central processor, which in turn routes the message to the receiving processor. [![Click To expand](https://box.kancloud.cn/dafeeaa1ca49227f61c6ca7a9d39b5ac_350x108.jpg)](fig11-6_0.jpg) Figure 11.6: (a) A completely connected network. (b) A star-connected network. (c) A bounded-degree network of degree 4. In a ***bounded-degree network*** of degree *d*, each processor is linked to at most *d* other 4. In a bounded-degree network of degree 4, a message can be passed by sending it first along one direction and then along the other direction until it reaches its destination. A slightly more complex, but popular, static network is the hypercube. A zero-dimensional hypercube consists of a single processor. A one-dimensional hypercube is formed by linking the processors in two zero-dimensional hypercubes. A two-dimensional hypercube is formed by linking each processor in a one-dimensional hypersube to one processor in another one-dimensional hypercube. Recursively, a (***d*** + 1) - ***dimensional hypercube*** is formed by linking each processor in a *d*-dimensional hypercube to one processor in another *d*-dimensional hypercube. A given processor in the first hypercube is linked to the processor occupying the corresponding position in the second hypercube. [Figure 11.7](#ch11fig07) illustrates hypercube networks. [![Click To expand](https://box.kancloud.cn/07e784b7d883cfadb5683f4dd91c33d2_350x192.jpg)](fig11-7_0.jpg) Figure 11.7: Hypercube networks. It should be clear that the reason why static networks are ordinarily used to implement message-passing architectures is that the processors in such networks are directly linked, enabling the flow of messages. ### Dynamic Interconnection Networks In a ***dynamic interconnection network***, processors are connected to memory through a set of switching elements. One of the most straightforward ways to do this is to use a ***crossbar switching network***. In such a network, *p* processor are connected to *m* memory banks using a grid of switching - elements, as shown in [Figure 11.8](#ch11fig08). If, for example, processor3 can currently access membank2, the switch at the grid position circled in [Figure 11.8](#ch11fig08) is closed (closing the switch completes the circuit, enabling the flow of electricity). This network is called "dynamic" because the connection between a processor and a memory bank is made dynamically when a switch is closed. A crossbar switching network is nonblocking. That is, the connection of one processor to a given memory bank does not block the connection of another processor to another memory bank. Ideally, in a crossbar switching network there should be a bank for every word in memory. However, this is clearly impractical. Ordinarily, the number of banks is at least as large as the number of processors, so that, at a given time, every processor is capable of accessing at least one memory bank. The number of switches in a crossbar switching in a crossbar switching network is equal to *pm*. Therefore, if we reqire that *m* be greater than or equal to *p*, the number of switches is greater than or equal to *p*2. As a result, crossbar switching networks can become quite costly when the number of processors is large. [![Click To expand](https://box.kancloud.cn/b96318e50e9f4a785cfe4634a262d009_350x210.jpg)](fig11-8_0.jpg) Figure 11.8: A crossbar switching network. There is a switch at every position on the grid. The circled switch is closed, enabling the flow of information between processor3 and membank2, when the third processor is currently allowed to access the second memory bank. Other dynamic interconnection networks, which will not be discussed here, include bus-based networks and multistage interconnection networks. It should be clear why dynamic iterconnection networks are ordinarily used to implement shared-address-space architectures. That is, in such networks each processor is allowed to access every word in memory but cannot send a direct message to any of the other processors. The introduction to parallel hardware presented here is based largely on the discussion in Kumar, Grama, Gupta, and Karypis (1994). The reader is referred to that text for a more thorough introduction and, in particular, for a discussion of bus-based and multistage interconnection networks.