Computer Science Question

please refer to attachments.

Textbooks Required:

A Grama, AGupra, G Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison Wesley, 2003.

Distributed Systems, 4th Edition, Version 4.01 Author(s) Maarten van Steen, AndrewS. Tanenbaum Publisher: CreateSpace; 3.01 edition (January, 2023) ISBN: 978-90-815406-3-6, (Printed version)

CS476
Parallel and distributed systems
Module 1
Introduction to Parallel and Distributed Computing.
1
Contents
1. Introduction-Scope, issues , applications.
2. Challenges of Parallel and Distributed Computing.
3. Implicit Parallelism: Trends in Microprocessor Architectures.
4. Dichotomy of Parallel Computing Platforms.
5. Physical Organization, co-processing.
2
Weekly Learning Outcomes
1. What is parallel or distributed computing and why is it necessary?
2. Explain Scope, and Applications and challenges of parallel and
distributed computing?
3. Understanding Parallel Platforms’ Communication Model.
3
Required Reading
Chapter 1 Introduction to Parallel and Distributed Computing: (A Grama, AGupra, G
Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison Wesley, 2003.)
Recommended Reading:
Introduction to parallel and distributed processing:

Implicit parallelism https://www.youtube.com/watch?v=9_uX7L7-7ME
https://www.cs.rochester.edu/users/faculty/sandhya/csc258/
4
What is Parallel Computing:
The process of running numerous processors an application or computation
simultaneously is referred to as parallel computing. In general, it refers to a type of
computing architecture where big issues are divided into separate, smaller, typically
related sections that can be processed all at once. Multiple CPUs work together to
complete it by exchanging information across shared memory, which then combines
the findings. It facilitates the execution of complex computations by distributing the
enormous problem among multiple processors.
What is Distributed Computing :
In distributed computing, a user sees a single system that is actually made up of several
autonomous computers. There is no shared memory in distributed systems, and
machines connect with one another by exchanging messages. A single work is split up
among several processors in distributed computing.
Types of parallel computing
1. Bit-level parallelism: The form of parallel computing in which every task is dependent
on processor word size. In terms of performing a task on large-sized data, it reduces the
number of instructions the processor must execute. There is a need to split the operation into
series of instructions. For example, there is an 8-bit processor, and you want to do an
operation on 16-bit numbers. First, it must operate the 8 lower-order bits and then the 8
higher-order bits. Therefore, two instructions are needed to execute the operation. The
operation can be performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor decides in
instruction- level parallelism how many instructions are implemented at the same time. For
each clock cycle phase, a processor in instruction-level parallelism can have the ability to
address that is less than one instruction.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks are
decomposed into subtasks. Then, each subtask is allocated for execution. And, the execution
of subtasks is performed concurrently by processors.
Parallel computing
Observation
High-performance distributed computing started
with parallel computing
Multiprocessor and multicore versus multicomputer
Moore’s law (attributed to Gordon Moore, Founder of Intel): Number
of transistors doubles every 2 years
Leveraging Moore’s Law
More transistors – opportunities for
exploiting parallelism
Implicit parallelism
Pipelining
Superscalar
Explicit parallelism
Streaming and multimedia processor
extensions
– E.g., MMX, Altivec
Very long instruction words (VLIW)
Uniprocessor Limits
The power problem!
http://www.tomshardware.com/2005/11/21/the_mother_of_all_cpu_charts_2005
S.No. Parallel Computing
Many operations are performed
1
simultaneously
Single computer is required
2
Distributed Computing
System components are located at
different locations
Uses multiple computers
3
Multiple processors perform multiple
operations
Multiple computers perform multiple
operations
4
It may have shared or distributed
memory
It have only distributed memory
5
Processors communicate with each
other through bus
Computer communicate with each
other through message passing.
6
Improves the system performance
Improves system scalability, fault
tolerance and resource sharing
capabilities
Applications of Parallel Computing
There are various applications of Parallel Computing, which are as
follows:
1. One of the primary applications of parallel computing is Databases
and Data mining.
2. The real-time simulation of systems is another use of parallel
computing.
3. The technologies, such as Networked videos and Multimedia.
4. Science and Engineering.
5. Collaborative work environments.
6. The concept of parallel computing is used by augmented reality,
advanced graphics, and virtual reality.
Applications of Distributed computing:
Social networks, mobile systems, online
banking, and online gaming (e.g. multiplayer
systems) also use efficient distributed
systems. Additional areas of application for
distributed computing include e-learning
platforms, artificial intelligence, and ecommerce.
What are the challenges of parallel and distributed
computing?
Important concerns are workload sharing, which attempts to take advantage of
access to multiple computers to complete jobs faster; task migration, which
supports workload sharing by efficiently distributing jobs among machines; and
automatic task replication, which occurs at different sites for greater
reliability.
Heterogeneity: The Internet enables users to access services and run applications
over a heterogeneous collection of computers and networks. …
Transparency: …
Openness. …
Concurrency. …
Security. …
Scalability. …
Failure Handling.
Distributed Systems Issues
• The lack of global knowledge.
Naming.
Scalability: Size, geographically, administratively
Compatibility.
Process synchronization (requires global knowledge)
Resource management (requires global knowledge)
Reliability/fault tolerance: fail-stop, byzantine (arbitrary) failure models
•
•
•
•
•
•
•
Problems:
Distributed consensus
– Replication, caching consistency
– Security and trust
–
Scope of Parallelism
Conventional architectures coarsely comprise of a processor,
memory system, and the data path. Each of these components’
present significant performance bottlenecks. Parallelism addresses
each of these components in significant ways. Different applications
utilize different aspects of parallelism – e.g., data intensive
applications utilize high aggregate throughput, server applications
utilize high aggregate network bandwidth, and scientific applications
typically utilize high processing and memory system performance. It
is important to understand each of these performance bottlenecks.
Implicit Parallelism:
Microprocessor clock speeds have posted impressive gains over
the past two decades (two to three orders of magnitude) Higher
levels of device integration have made available a large number of
transistors.
-The question of how best to utilize these resources is an important
one.
-Current processors use these resources in multiple functional units
and execute multiple instructions in the same cycle.
-The precise manner in which these instructions are selected and
executed provides impressive diversity in architectures.
Very Long Instruction Word (VLIW) Processors
The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
#To address this issues, VLIW processors rely on compile time analysis to identify and
bundle together instructions that can be executed concurrently.
#These instructions are packed and dispatched together, and thus the name very long
instruction word.
#This concept was used with some commercial success in the Multiflow Trace machine
(circa 1984).
#Variants of this concept are employed in the Intel IA64 processors.
Dichotomy of Parallel Computing Platforms:
•
An explicitly parallel program must specify concurrency and
interaction between concurrent subtasks.
•
The former is sometimes also referred to as the control structure
and the latter as the communication model.
Control Structure of Parallel Programs
❑
❑
❑
❑
❑
Parallelism can be expressed at various levels of granularity – from instruction
level to processes.
Between these extremes exist a range of models, along with corresponding
architectural support.
Processing units in parallel computers either operate under the centralized
control of a single control unit or work independently.
If there is a single control unit that dispatches the same instruction to
various processors (that work on different data), the model is referred to as
single instruction stream, multiple data stream (SIMD).
If each processor has its own control control unit, each processor can execute
different instructions on different data items. This model is called multiple
instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).
SIMD-MIMD Comparison
•
Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1
belonged to this class of machines.
•
Variants of this concept have found use in coprocessing units such as the MMX units in Intel
processors and DSP chips such as the Sharc.
•
SIMD relies on the regular structure of
computations (such as those in image processing).
•
It is often necessary to selectively turn off
operations on certain data items. For this reason,
most SIMD programming paradigms allow for an
“activity mask”, which determines if a processor
should participate in a computation or not.
•
In contrast to SIMD processors, MIMD processors
can execute different programs on different
processors.
•
A variant of this, called single program multiple
data streams (SPMD) executes the same program
on different processors.
•
It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and
underlying architectural support.
•
Examples of such platforms include current
generation Sun Ultra Servers, SGI Origin Servers,
multiprocessor PCs, workstation clusters, and the
IBM SP.
Communication Model of Parallel Platforms
✓There are two primary forms of data exchange between parallel
tasks – accessing a shared data space and exchanging messages.
✓Platforms that provide a shared data space are called shared
address-space machines or multiprocessors.
✓Platforms that support messaging are also called message passing
platforms or multi computers.
Shared Memory Parallel Systems
•
•
Multiple processors can access the same memory simultaneously
Challenges:
One processor’s cache may contain a copy of data that was just
modified by another
Requires hardware support for coherence
Two processes’ operations may be interleaved
• in unintuitive ways
Requires hardware support and guarantees on atomicity and
ordering
–
•
Shared-Memory Parallel Computer
Shared-Memory Parallel Computer
Shared Memory
Programming through threading
Multiple processors share a pool of memory
Problems: cache coherence
UMA vs. NUMA architecture
Pros:
Easier to program (probably)
Cons:
Performance may surfer if the memory is located on
distant machines
Limited scalability
Distributed-Memory Parallel Computer
Distributed-Memory Parallel Computer
Distributed Memory
Programming through processes
Explicit message passing
Networking
Pros:
Tighter control on message passing
Cons:
Harder to program
Modern supercomputers are hybrids!
Multicore Resource Management
Multicore systems contain many shared resources, e.g.,
memory bandwidth and cache
Problem:
–
Management of shared resources for Efficiency, fairness
Distributed Memory Parallel Systems
❖Parallel systems that do not share memory
❖Software system support for communication
(point-to-point, group)
❖Data must be explicitly partitioned and transferred
when needed
❖Dynamic workload management?
Shared-Address-Space Platforms
• Part (or all) of the memory is accessible to all processors.
• Processors interact by modifying data objects stored in this
shared-address-space.
• If the time taken by a processor to access any memory
word in the system global or local is identical, the
platform is classified as a uniform memory access
(UMA), else, a non uniform memory access (NUMA)
machine.
NUMA and UMA Shared-Address-Space Platforms:
▪ The distinction between NUMA and UMA platforms is important from the point of
view of algorithm design. NUMA machines require locality from underlying
algorithms for performance. • Programming these platforms is easier since reads and
writes are implicitly visible to other processors.
▪ However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).
▪ Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.
▪ A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.
•
Shared-Address-Space vs. Shared Memory Machines
o It is important to note the difference between the terms
shared address space and shared memory.
o We refer to the former as a programming abstraction and to
the latter as a physical machine attribute.
o It is possible to provide a shared address space using a
physically distributed memory.
Physical Organization of Parallel Platforms:
We begin this discussion with an ideal parallel machine called Parallel Random Access
Machine, or PRAM.
Architecture of an Ideal Parallel Computer
o A natural extension of the Random Access Machine (RAM) serial
architecture is the Parallel Random Access Machine, or PRAM.
o PRAMs consist of p processors and a global memory of unbounded size
that is uniformly accessible to all processors.
o Processors share a common clock but may execute different instructions in
each cycle.
o Depending on how simultaneous memory accesses are handled, PRAMs
can be divided into four subclasses.
Contin….
• Exclusive-read, exclusive-write (EREW) PRAM.
• Concurrent-read, exclusive-write (CREW) PRAM
• Exclusive-read, concurrent-write (ERCW) PRAM.
• Concurrent-read, concurrent-write (CRCW) PRAM.
Physical Complexity of an Ideal Parallel Computer
• Processors and memories are connected via switches.
• Since these switches must operate in O(1) time at the level
of words, for a system of p processors and m words, the
switch complexity is O (mp ).
• Clearly, for meaningful values of p and m, a true PRAM is
not realizable.
Thank You
CS476
Parallel and distributed systems
Module 2
Introduction & Overview -Distributed Computing.
38
Contents
1. Distributed versus Decentralized
2. Perspectives on distributed systems
3. What do we want to achieve? Overall design goals
4. Different Terminology: Failure, error, fault
5. Developing distributed systems: Pitfalls
39
Weekly Learning Outcomes
1. Describe about distributed computing overall design goals for Support
sharing of resources.
2. Understand Distribution transparency, Openness and Scalability.
40
Required Reading
Chapter 1 Introduction: Distributed Systems, 4th Edition, Version
4.01 Author(s) Maarten van Steen, AndrewS.
Tanenbaum Publisher: CreateSpace; 3.01 edition (January, 2023) ISBN: 97890-815406-3-6, (Printed version)
Recommended Reading:

41
Distributed versus Decentralized
What many people state
Distributed
DecentralizedCentralized
When does a decentralized system become distributed?
Adding 1 link between two nodes in a decentralized system?
Adding 2 links between two other nodes?
In general: adding k > 0 links….?
Alternative approach
Two views on realizing distributed systems
Integrative view: connecting existing networked computer systems into a
larger a system.
Expansive view: an existing networked computer systems is extended
with additional computers
Two definitions
A decentralized system is a networked computer system in which
processes and resources are necessarily spread across multiple
computers.
A distributed system is a networked computer system in which processes
and resources are sufficiently spread across multiple computers.
Some common misconceptions
Centralized solutions do not scale
Make distinction between logically and physically centralized. The root of the Domain Name
System:
logically centralized
physically (massively) distributed
decentralized across several organizations
Centralized solutions have a single point of failure
Generally, not true (e.g., the root of DNS). A single point of failure is often:
easier to manage
easier to make more robust
Important
There are many, poorly founded, misconceptions regarding scalability, fault tolerance, security,
etc. We need to develop skills by which distributed systems can be readily understood so as to
judge such misconceptions.
Perspectives on distributed systems
Distributed systems are complex: take persepctives
Architecture: common organizations
Process: what kind of processes, and their relationships
Communication: facilities for exchanging data
Coordination: application-independent algorithms
Naming: how do you identify resources?
Consistency and replication: performance requires of data, which need to
be the same
Fault tolerance: keep running in the presence of partial failures
Security: ensure authorized access to resources
What do we want to achieve?
Overall design goals
-Support sharing of resources
-Distribution transparency
-Openness
-Scalability
Sharing resources
Canonical examples
Cloud-based shared storage and files
Peer-to-peer assisted multimedia streaming
Shared mail services (think of outsourced mail systems)
Shared Web hosting (think of content distribution networks)
Observation
“The network is the computer”
(quote from John Gage, then at Sun Microsystems)
Distribution transparency
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that its
processes and resources are physically distributed across multiple computers,
possibly separated by large distances. Observation.
Distribution transparancy is handled through many different techniques in a layer
between applications and operating systems: a middleware layer
Distribution transparency: Types
Transparency
Access
Location
Relocation
Migration
Replication
Concurrency
Failure
Description
Hide differences in data representation and how
an object is accessed
Hide where an object is located
Hide that an object may be moved to another
location
while in use
Hide that an object may move to another location
Hide that an object is replicated
Hide that an object may be shared by several
independent users
Hide the failure and recovery of an object
Dependability
Basics
A component provides services to clients.
To provide services, the component may
require the services from other
components ⇒ a component may
depend on some other component.
Specifically
A component C depends on C∗ if the
correctness of C’s behavior depends on
the correctness of C∗’s behavior.
(Components are processes or channels.)
Requirements related to dependability
Requirement
Description
Availability
Readiness for usage
Reliability
Continuity of service delivery
Safety
Very low probability of catastrophes
Maintainability
How easy can a failed system be repaired
Terminology
Failure, error, fault
Term
Description
Example
Failure
A component is not living up to its
specifications
Crashed program
Error
Part of a component that can lead to a
failure
Programming bug
Fault
Cause of an error
Sloppy programmer
Terminology
Handling faults
Term
Fault prevention
Fault tolerance
Fault removal
Fault forecasting
Description
Prevent the occurrence of a
fault
Build a component and make it
mask the occurrence of a fault
Reduce the presence, number,
or seriousness of a fault
Estimate current presence,
future incidence, and
consequences of faults
Example
Don’t hire sloppy
programmers
Build each component by two
independent programmers
Get rid of sloppy
programmers
Estimate how a recruiter is
doing when it comes to hiring
sloppy programmers
On security
Observation
A distributed system that is not secure, is not dependable
What we need
Confidentiality: information is disclosed only to authorized parties
Integrity: Ensure that alterations to assets of a system can be made only in
an authorized way
Authorization, Authentication, Trust
Authentication: verifying the correctness of a claimed identity
Authorization: does an identified entity has proper access rights?
Trust: one entity can be assured that another will perform particular actions
according to a specific expectation
Security mechanisms
Keeping it simple
It’s all about encrypting and decrypting data using security keys.
Notation
K (data) denotes that we use key K to encrypt/decrypt data.
Symmetric cryptosystem
With encryption key EK (data) and decryption key DK (data):
if data = DK (EK (data)) then DK = EK . Note: encryption and descryption key are the
same and should be kept secret.
Asymmetric cryptosystem
Distinguish a public key PK (data) and a private (secret) key SK (data).
Security mechanisms
Secure hashing
In practice, we use secure hash functions: H(data)
returns a fixed-length string.
Any change from data to data∗ will lead to a completely
different string
H(data∗).
Given a hash value, it is computationally impossible to find
a data with
h = H(data)
Practical digital signatures
Sign message for Bob by Alice:
Scale in distributed systems
Observation
Many developers of modern distributed systems easily use the
adjective “scalable” without making clear why their system actually
scales.
At least three components
Number of users or processes (size scalability)
Maximum distance between nodes (geographical scalability)
Number of administrative domains (administrative scalability)
Observation
Most systems account only, to a certain extent, for size scalability. Often
a solution: multiple powerful servers operating independently in parallel.
Today, the challenge still lies in geographical and administrative
scalability.
Formal analysis
A centralized service can be modeled as a simple queuing system
Assumptions and notations
The queue has infinite capacity ⇒ arrival rate of requests is not influenced by current queue
length or what is being processed.
Arrival rate requests: λ
Processing capacity service: µ requests per second
Fraction of time having k requests in the system
Formal analysis
Utilization U of a service is the fraction of time that it is busy
Average number of requests in the system
Average throughput
Formal analysis
Response time: total time take to process a request after submission
with S = 1µ being the
service time.
Observations
If U is small, response-to-service time is close to 1: a request is
immediately processed
If U goes up to 1, the system comes to a grinding halt. Solution:
decrease S.
Problems with geographical scalability
•Cannot simply go from LAN to WAN: many distributed systems assume
synchronous client-server interactions: client sends request and waits for
an answer. Latency may easily prohibit this scheme.
•WAN links are often inherently unreliable: simply moving streaming video
from LAN to WAN is bound to fail.
•Lack of multipoint communication, so that a simple search broadcast
cannot be deployed. Solution is to develop separate naming and directory
services (having their own scalability problems).
Problems with administrative scalability
Essence
Conflicting policies concerning usage (and thus payment), management, and security
Examples
Computational grids: share expensive resources between different domains.
Shared equipment: how to control, manage, and use a shared radio telescope
constructed as large-scale shared sensor network?
Exception: several peer-to-peer networks
File-sharing systems (based, e.g., on BitTorrent)
Peer-to-peer telephony (early versions of Skype)
Peer-assisted audio streaming (Spotify)
Note: end users collaborate and not administrative entities.
Techniques for scaling
Facilitate solution by moving computations to client
Techniques for scaling
Partition data and computations across multiple machines
Move computations to clients (Java applets and scripts)
Decentralized naming services (DNS)
Decentralized information systems (WWW)
Replication and caching: Make copies of data available at different machines
Replicated file servers and databases
Mirrored Websites
Web caches (in browsers and proxies)
File caching (at server and client)
Scaling: The problem with replication
Applying replication is easy, except for one thing
Having multiple copies (cached or replicated), leads to inconsistencies:
modifying one copy makes that copy different from the rest.
Always keeping copies consistent and in a general way requires global
synchronization on each modification.
Global synchronization precludes large-scale solutions.
Observation
If we can tolerate inconsistencies, we may reduce the need for global
synchronization, but tolerating inconsistencies is application dependent.
Parallel computing
Observation
High-performance distributed computing started with parallel
computing
Multiprocessor and multicore versus multicomputer
Distributed shared memory systems
Observation
Multiprocessors are relatively easy to program in comparison to multi computers yet have
problems when increasing the number of processors (or cores). Solution: Try to
implement a shared-memory model on top of a multicomputer.
Example through virtual-memory techniques
Map all main-memory pages (from different processors) into one single virtual address
space. If a process at processor A addresses a page P located at processor B, the OS
at A traps and fetches P from B, just as it would if P had been located on local disk.
Problem
Performance of distributed shared memory could never compete with that of
multiprocessors and failed to meet the expectations of programmers. It has been widely
abandoned by now.
Cluster computing
Essentially a group of high-end systems connected through a LAN
Homogeneous: same OS, near-identical hardware
Single, or tightly coupled managing node(s)
Grid computing
The next step: plenty of nodes from everywhere
Heterogeneous
Dispersed across several organizations
Can easily span a wide-area network
Note
To allow for collaborations, grids generally use virtual organizations.
In essence, this is a grouping of users (or better: their IDs) that
allows for authorization on resource allocation.
Architecture for grid computing
The layers
Fabric: Provides interfaces to local
resources (for querying state and
capabilities, locking, etc.)
Connectivity: Communication/transaction
protocols, e.g., for moving data between
resources. Also various authentication
protocols.
Resource: Manages a single resource,
such as creating processes or reading
data.
Collective: Handles access to multiple
resources: discovery, scheduling,
replication.
Application: Contains actual grid
applications in a single organization.
Distributed pervasive systems
Observation
Emerging next-generation of distributed systems in which nodes are
small, mobile, and often embedded in a larger system, characterized by
the fact that the system naturally blends into the user’s environment.
Three (overlapping) subtypes
Ubiquitous computing systems: pervasive and continuously present, i.e.,
there is a continuous interaction between system and user.
Mobile computing systems: pervasive, but emphasis is on the fact that
devices are inherently mobile.
Sensor (and actuator) networks: pervasive, with emphasis on the actual
(collaborative) sensing and actuation of the environment.
Ubiquitous systems
Core elements
(Distribution) Devices are networked, distributed, and accessible
transparently
(Interaction) Interaction between users and devices is highly unobtrusive
(Context awareness) The system is aware of a user’s context to optimize
interaction
(Autonomy) Devices operate autonomously without human intervention, and
are thus highly self-managed
(Intelligence) The system as a whole can handle a wide range of dynamic
actions and interactions
Mobile computing
Distinctive features
A myriad of different mobile devices (smartphones, tablets, GPS devices,
remote controls, active badges).
Mobile implies that a device’s location is expected to change over time ⇒
change of local services, reachability, etc. Keyword: discovery.
Maintaining stable communication can introduce serious problems.
For a long time, research has focused on directly sharing resources
between mobile devices. It never became popular and is by now
considered to be a fruitless path for research.
Bottomline
Mobile devices set up connections to stationary servers, essentially bringing
mobile computing in the position of clients of cloud-based services.
Mobile computing
Mobile cloud computing
Mobile edge computing
Sensor networks
Characteristics
The nodes to which sensors are attached are:
Many (10s-1000s)
Simple (small memory/compute/communication
capacity)
Often battery-powered (or even battery-less)
Sensor networks as distributed databases
Two extremes
The cloud-edge continuum
Developing distributed systems: Pitfalls
Observation
Many distributed systems are needlessly complex, caused by mistakes that required patching
later on. Many false assumptions are often made.
False (and often hidden) assumptions
1. The network is reliable
2. The network is secure
3. The network is homogeneous
4. The topology does not change
5. Latency is zero
6. Bandwidth is infinite
7. Transport cost is zero
8. There is one administrator
CS476
Parallel and distributed systems
Module 3
Principles of Parallel Algorithm Design
78
Contents
1. Introduction to Parallel Algorithms: Decomposition, Tasks, and Dependency
Graphs
2. Decomposition Techniques
3.Characteristics of Tasks and Interactions Dependency Graphs
4. Mapping Techniques for Load Balancing
79
Weekly Learning Outcomes
1. Explain Parallel Algorithms and Decomposition Techniques.
2. Discussing Characteristics of Tasks and Interactions Dependency Graphs.
3. Understand the basic Mapping Techniques for Load Balancing.
80
Required Reading
Chapter 3 Principles of parallel Algorithm design: (A Grama, AGupra, G
Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison
Wesley, 2003.)
Recommended Reading:

https://www.cs.purdue.edu/homes/ayg/book/Slides/chap3_slides.pdf
81
Preliminaries: Decomposition, Tasks, and Dependency Graphs
The first step in developing a parallel algorithm is to decompose the problem
into tasks that can be executed concurrently.
A given problem may be docomposed into tasks in many different ways.
Tasks may be of same, different, or even interminate sizes.
A decomposition can be illustrated in the form of a directed graph with
nodes corresponding to tasks and edges indicating that the result of one
task is required for processing the next. Such a graph is called a task
dependency gra ph.
Steps in the Parallelization
•
Decomposition into tasks
Expose concurrency
Assignment to processes
Balancing load and maximizing locality
Orchestration
Name and access data
Communicate (exchange) data
synchronization among processes
Mapping
Assignment of processes to processors
–
•
–
•
–
–
–
•
–
Decomposition into Tasks
•
•
Many different decompositions possible
– Tasks may be independent or have dependencies
requiring ordering
– Tasks may execute identical or different code
– Tasks may take the same or different amounts of time
Tasks and dependencies may be abstracted into a task
dependency DAG with nodes as tasks, edges as control
dependence
Granularity of Task Decompositions
Task size (granularity) versus number of
tasks
Example: Dense matrix-vector multiply
Fine grain: each task computes an individual
element in y, large number of tasks
•
Coarse grain: each task computes multiple
elements in y, small number
Decomposition Techniques:
Exploratory Decomposition:
• Decomposition is fixed/static
from the design – Data and
recursive
• Exploration (search) of a state space of solutions
–Problem decomposition reflects shape of execution
• Goes hand-in-hand with its execution
• Examples
– discrete optimization, e.g. 0/1 integer programming
– theorem proving
• game playing
Example:
Solve a 15 puzzle
Sequence of three moves from state (a) to final state (d)
Solving a 15 puzzle
Search
generate successor states of the current state
explore each as an independent task
Exploratory Decomposition Speedup:
Solve a 15 puzzle
The decomposition behaves according to the parallel
formulation – May change the amount of work done
Speculative Decomposition
•
Dependencies between tasks are not known apriori. – Impossible to identify independent
tasks
• Two approaches
– Conservative approaches, which identify independent tasks only
when they are guaranteed to not have dependencies
• May yield little concurrency
– Optimistic approaches, which schedule tasks even when they may
potentially be inter-dependent
• Roll-back changes in case of an error
Discrete event simulation
• Centralized time-ordered event list
– you get up ->get ready->drive to work->work->eat lunchà>work some more->drive back->eat dinner->and sleep
• Simulation
–extract next event in time order
–process the event
–if required, insert new events into the event list
• Optimistic event scheduling
–assume outcomes of all prior events
–speculatively process next event
–if assumption is incorrect, roll back its effects and
continue
Simulation of a network of nodes
• Simulate network behavior for various input and
node delays – The input are dynamically changing
• Thus task dependency is unknown
Speculative vs Exploratory:
Exploratory decomposition
– The output of multiple tasks from a branch is unknown
– Parallel program perform more, less or same amount of work as
serial program
• Speculative
– The input at a branch leading to multiple parallel tasks is
unknown
– Parallel program perform more or same amount of work
as the serial algorithm Use multiple decomposition
techniques together
• One decomposition may be not optimal for concurrency
– Quicksort recursive decomposition limits concurrency (Why?)
Combined recursive and data decomposition for MIN
Characteristics of Tasks
❖
❖
❖
❖
❖
❖
❖
❖
❖
Theory
– Decomposition: to parallelize theoretically
Concurrency available in a problem
Practice
– Task creations, interactions and mapping to PEs.
Realizing concurrency practically
Characteristics of tasks and task interactions
Impact choice and performance of parallelism
Characteristics of tasks
Task generation strategies
Task sizes (the amount of work, e.g. FLOPs)
Size of data associated with tasks
Task Generation
–
–
–
• Static task generation
Concurrent tasks and task graph known a-priori (before execution)
Typically using recursive or data decomposition
Examples
Matrix operations
Graph algorithms
Image processing applications
•
•
•
• Other regularly structured problems
Dynamic task generation
–
Computations formulate concurrent tasks and task graph on
the fly
Not explicit a priori, though high-level
rules or guidelines known – Typically by
exploratory or speculative decompositions.
Also possible by recursive
decomposition, e.g. quicksort – A
classic example: game playing
15 puzzle board
•
•
•
Task Sizes/Granularity
•
–
•
–
•The amount of work à amount of
time to complete – E.g. FLOPs,
#memory access
Uniform:
Often by even data decomposition, i.e. regular
Non-uniform
Quicksort, the choice of pivot
Size of Data Associated with Tasks:
• May be small or large compared to the task sizes
How relevant to the input and/or output data sizes
Example:
size(input) < size(computation), e.g., 15 puzzle size(input) = size(computation) > size(output), e.g., min
size(input) = size(output) < size(computation), e.g., sort Considering the efforts to reconstruct the same task context small data: small efforts: task can easily migrate to another process large data: large efforts: ties the task to a process Context reconstructing vs communicating • – It depends – – • • • • – – • Characteristics of Task Interactions: – – – – – • • Aspects of interactions What: shared data or synchronizations, and sizes of the media When: the timing Who: with which task(s), and overall topology/patterns Do we know details of the above three before execution How: involve one or both? The implementation concern, implicit or explicit Orthogonal classification • Static vs. dynamic • Regular vs. irregular • Read-only vs. read-write • One-sided vs. two-sided • Aspects of interactions What: shared data or synchronizations, and sizes of the media – When: the timing – Who: with which task(s), and overall topology/patterns – Do we know details of the above three before execution – How: involve one or both? • Static interactions – Partners and timing (and else) are known a-priori – Relatively simpler to code into programs. • Dynamic interactions – – – The timing or interacting tasks cannot be determined a-priori. Harder to code, especially using explicit interaction. Example of Regular Static Interaction: Image processing algorithms: dithering, edge detection Nearest neighbor interactions on a 2D mesh Example of Irregular Static Interaction Sparse matrix vector multiplication Example: Task-Interaction Graph Sparse matrix-vector multiplication •Tasks: each task computes an entry of y[] •Assign ithrow of A to Task i. Also assign b[i] to Task i. Characteristics of Task Interactions: Aspects of interactions – What: shared data or synchronizations, and sizes of the media Read-only interactions – Tasks only read data items associated with other tasks Read-write interactions Read, as well as modify data items associated with other tasks. Harder to code Require additional synchronization primitives to avoid read-write and write-write ordering races Mapping Techniques for Load Balancing • Static and Dynamic Mapping: Parallel algorithm design Program decomposed Characteristics of task and interactions identified • Assign large amount of concurrent tasks to equal or relatively small amount of processes for execution Though often we do 1:1 mapping – – • Goal of mapping: minimize overheads There is cost to do parallelism Interactions and idling(serialization) Contradicting objectives: interactions vs idling Idling (serialization) ñ: insufficient parallelism Interactions ñ: excessive concurrency E.g. Assigning all work to one processor trivially minimizes interaction at the expense of significant idling. – • • – – – Mapping Techniques for Minimum Idling: • • Execution: alternating stages of computation and interaction Mapping must simultaneously minimize idling and load balance Idling means not doing useful work Load balance: doing the same amount of work Merely balancing load does not minimize idling – – • Example: – – – – • • • Static or dynamic mapping: Static Mapping Tasks are mapped to processes a-prior Need a good estimate of task sizes Optimal mapping may be NP complete Dynamic Mapping Tasks are mapped to processes at runtime • Because: Tasks are generated at runtime Their sizes are not known. Other factors determining the choice of mapping techniques – the size of data associated with a task – the characteristics of inter-task interactions – even the programming models and target architectures • • • • Schemes for Static Mapping: •Mappings based on data decomposition – Mostly 1-1 mapping Mappings based on task graph partitioning Hybrid mappings • Mappings Based on Data Partitioning Partition the computation using a combination of • – Data decomposition • – The ``owner-computes'' rule Example: 1-D block distribution of 2-D dense matrix 1-1 mapping of task/data and process Block Array Distribution Schemes: Block Distribution and Data Sharing for Dense Matrix Multiplication: • • Cyclic and Block Cyclic Distributions Consider a block distribution for LU decomposition (Gaussian Elimination) The amount of computation per data item varies Block decomposition would lead to significant load imbalance – – • Block Cyclic Distributions: Variation of the block distribution scheme Partition an array into many more blocks (i.e. tasks) than the number of available processes. Blocks are assigned to processes in a round-robin manner so that each process gets several non- adjacent blocks. N-1 mapping of tasks to processes Used to alleviate the load-imbalance and idling problems. – – – • Block Partitioning and Random Mapping: Sparse matrix computations •Load imbalance using block-cyclic partitioning/mapping •more non-zero blocks to diagonal processes P0, P5, P10, and P15 than others •P12 gets nothing Graph Partitioning Based Data Decomposition: Array-based partitioning and static mapping • Regular domain, i.e. rectangular, mostly dense matrix • Structured and regular interaction patterns • Quite effective in balancing the computations and minimizing the interactions • Irregular domain • Spars matrix-related • Numerical simulations of physical phenomena • Car, water/blood flow, geographic • Partition the irregular domain so as to • Assign equal number of nodes to each process • Minimizing edge count of the partition. • Mappings Based on Task Partitioning: • Schemes for Static Mapping – Mappings based on data partitioning • Mostly 1-1 mapping – Mappings based on task graph partitioning – Hybrid mappings • Data partitioning • – Data decomposition and then 1-1 mapping of tasks to PEs Partitioning a given task-dependency graph across processes • An optimal mapping for a general task-dependency graph – NPcomplete problem. • Excellent heuristics exist for structured graphs. Mapping a Binary Tree Dependency Graph: • Mapping dependency graph of quick sort to processes in a hypercube • Hypercube: n-dimensional analogue of a square and a cube – node numbers that differ in 1 bit are adjacent Hierarchical/Hybrid Mappings: • A single mapping is inadequate. – E.g. task graph mapping of the binary tree (quicksort) cannot use a large number of processors. •Hierarchical mapping – Task graph mapping at the top level – Data partitioning within each level. • Schemes for Dynamic Mapping: • Also referred to as dynamic load balancing • – Load balancing is the primary motivation for dynamic mapping. Dynamic mapping schemes can be Centralized Distributed – – Centralized Dynamic Mapping: • Processes are designated as masters or slaves Workers (slave is politically incorrect) General strategies Master has pool of tasks and as central dispatcher When one runs out of work, it requests from master for more work. Challenge When process # increases, master may become the bottleneck. Approach Chunk scheduling: a process picks up multiple tasks at once Chunk size: Large chunk sizes may lead to significant load imbalances as well Schemes to gradually decrease chunk size as the computation progresses. – • – – • – • – – • • Distributed Dynamic Mapping: • All processes are created equal Each can send or receive work from others Alleviates the bottleneck in centralized schemes. Four critical design questions: how are sending and receiving processes paired together who initiates work transfer how much work is transferred when is a transfer triggered? Answers are generally application specific. – • • – – – – • CS476 Parallel and distributed systems Module 4 Naming Contents Naming, Identifiers & Addresses .1 Chord .2 Hierarchical Location Services (HLS) .3 Security in flat naming .4 Mounting in distributed systems .5 Name-space implementation .6 Iterative name resolution .7 Scalability issues .8 Modern & Secure DNS (Domain Name Service) .9 LDAP (Lightweight Directory Access Protocol).10 Weekly Learning Outcomes 1. Understand the concept of Naming and Mounting in distributed systems. 2. Learn about Modern & Secure Domain Name Service and Lightweight Directory Access Protocol. Required Reading Chapter 6 Naming: Distributed Systems, 4th Edition, Version 4.01 Author(s) Maarten van Steen, Andrew S. Tanenbaum Publisher: CreateSpace; 3.01 edition (January, 2023) ISBN: 978-90-815406-3-6, (Printed version) Recommended Reading http://cs.boisestate.edu/~amit/teaching/455/handout s/chap-05v2.pdf https://www.youtube.com/watch?v=Yma7rmHqWy8 Naming, Identifiers & Addresses Essence Names are used to denote entities in a distributed system. To operate on an entity, we need to access it at an access point. Access points are entities that are named by means of an address. Note A location-independent name for an entity E , is independent of the addresses of the access points offered by E . Pure name A name that has no meaning at all; it is just a random string. Pure names can be used for comparison only. Identifier: A name having some specific properties An identifier refers to at most one entity. .1 Each entity is referred to by at most one identifier. .2 An identifier always refers to the same entity (i.e., it is never reused). .3 Observation An identifier need not necessarily be a pure name, i.e., it may have content. Naming, Identifiers & Addresses Properties of a true identifier: An identifier refers to at most one entity. • Each entity is referred to by at most one identifier. • An identifier always refers to the same entity (i.e., it is • never reused). Addresses and identifiers are two important types of names that are each used for very different purposes. In many computer systems, addresses and identifiers are represented in machine-readable form only, that is, in the form of bit strings. For example, an Ethernet address is essentially a random string of 48 bits. Likewise, memory addresses are typically represented as 32-bit or 64-bit strings. Types of Naming Systems Flat naming: The identifier is simply a random • bit string. It does not contain any information whatsoever on how to locate an access point of its associated entity. Good for machines. Structured naming: Composed of simple • human-readable names. Examples are file system naming and host naming on the Internet. Attribute-based naming: Allows an entity to be • described by (attribute, value) pairs. This allows a user to search more effectively by constraining some of the attributes. Broadcasting Broadcast the ID, requesting the entity to return its current address Can never scale beyond local-area networks • Requires all processes to listen to incoming location • requests Address Resolution Protocol (ARP) To find out which MAC address is associated with an IP address, broadcast the query “who has this IP address”? Forwarding pointers When an entity moves, it leaves behind a pointer to its next location Dereferencing can be made entirely transparent to • clients by simply following the chain of pointers Update a client’s reference when present location is • found Geographical scalability problems (for which • separate chain reduction mechanisms are needed): Long chains are not fault tolerant • Increased network latency at dereferencing • The principle of mobile IP (Home based approach) Illustrative: Chord Consider the organization of many nodes into a logical ring Each node is assigned a random m-bit identifier. • Every entity is assigned a unique m-bit key. • Entity with key k falls under jurisdiction of node with • smallest id ≥ k (called its successor succ(k )). Nonsolution Let each node keep track of its neighbor and start linear search along the ring. Notation We will speak of node p as the node have identifier p Chord lookup example Resolving key 26 from node 1 and key 12 from node 28 Hierarchical Location Services (HLS) Basic idea Build a large-scale search tree for which the underlying network is divided into hierarchical domains. Each domain is represented by a separate directory node. Principle HLS: Tree organization Invariants Address of entity E is stored in a leaf or intermediate node • Intermediate nodes contain a pointer to a child if and only if the subtree • rooted at the child stores an address of the entity The root knows about all entities • Storing information of an entity having two addresses in different leaf domains HLS: Lookup operation Basic principles Start lookup at local leaf node • Node knows about E ⇒ follow downward pointer, else go up • Upward lookup always stops at root • Looking up a location HLS: Insert operation (a) An insert request is forwarded to the first node that knows about entity E . (b) A chain of forwarding pointers to the leaf node is created (a) (b) Can an HLS scale? Observation A design flaw seems to be that the root node needs to keep track of all identifiers ⇒ make a distinction between a logical design and its physical implementation. Notation Assume there are a total of N physical hosts {H1, H2 , . . . , HN }. Each host is • capable of running one or more location servers. Dk (A) denotes the domain at level k that contains address A; k = 0 • denotes the root domain. LSk (E, A) denotes the unique location server in Dk (A) responsible for • keeping track of entity E . Basic idea for scaling Choose different physical servers for the logical name servers on a • per-entity basis (at root level, but also intermediate) • Implement a mapping of entities to physical servers such that the load of • storing records will be distributed Can an HLS scale? Solution Dk = {Dk,1, Dk,2 , . . . , Dk,Nk }denotes the Nk domains at level k • Note: N0 = |D0|= 1. • For each level k , the set of hosts is partitioned into Nk subsets, with each • host running a location server representing exactly one of the domains Dk,i from Dk . Principle of distributing logical location servers Security in flat naming Basics Without special measures, we need to trust that the name-resolution process to return what is associated with a flat name. Two approaches to follow: Secure the identifier-to-entity association • Secure the name-resolution process • Self-certifying names Use a value derived from the associated entity and make it (part of) the flat name: id(entity) = hash(data associated with the entity) • when dealing with read-only entities, otherwise id(entity) = public key(entity) • in which case additional data is returned, such as a verifiable digital signature. Securing the name-resolution process Much more involved: discussion deferred until discussing secure DNS. Name space Naming graph A graph in which a leaf node represents a (named) entity. A directory node is an entity that refers to other nodes. A general naming graph with a single root node Note A directory node contains a table of (node identifier, edge label) pairs. Name space We can easily store all kinds of attributes in a node Type of the entity • An identifier for that entity • Address of the entity’s location • Nicknames • ... • Note Directory nodes can also have attributes, besides just storing a directory table with (identifier, label) pairs. Name resolution Problem To resolve a name, we need a directory node. How do we actually find that (initial) node? Closure mechanism: The mechanism to select the implicit context from which to start name resolution www.distributed-systems.net : start at a DNS name server • /home/maarten/mbox: start at the local NFS file server (possible • recursive search) 0031 20 598 7784: dial a phone number • 77.167.55.6 : route message to a specific IP address • Name linking Hard link What we have described so far as a path name: a name that is resolved by following a specific path in a naming graph from one node to another. Soft link: Allow a node N to contain a name of another node First resolve N’s name (leading to N) • Read the content of N, yielding name • Name resolution continues with name • Observations The name resolution process determines that we read the content of a • node, in particular, the name in the other node that we need to go to. One way or the other, we know where and how to start name resolution • given name Name linking The concept of a symbolic link explained in a naming graph Observation Node n5 has only one name Mounting Issue Name resolution can also be used to merge different name spaces transparently through mounting: associating a node identifier of another name space with a node in a current name space. Terminology Foreign name space: the name space that needs to be accessed • Mount point: the node in the current name space containing the node • identifier of the foreign name space Mounting point: the node in the foreign name space where to continue • name resolution Mounting across a network The name of an access protocol. .1 The name of the server. .2 The name of the mounting point in the foreign name space. .3 Mounting in distributed systems Mounting remote name spaces through a specific access protocol Name-space implementation Basic issue Distribute the name resolution process as well as name space management across multiple machines, by distributing nodes of the naming graph. Distinguish three levels Global level: Consists of the high-level directory nodes. Main aspect is • that these directory nodes have to be jointly managed by different administrations Administrational level: Contains mid-level directory nodes that can be • grouped in such a way that each group can be assigned to a separate administration. Managerial level: Consists of low-level directory nodes within a single • administration. Main issue is effectively mapping directory nodes to local name servers. Name-space implementation An example partitioning of the DNS name space, including network files Name-space implementation A comparison between name servers for implementing nodes in a name space Item Global Administrational Managerial 1 Worldwide Organization Department 2 Few Many Vast numbers 3 Seconds Milliseconds Immediate 4 Lazy Immediate Immediate 5 Many None or few None 6 Yes Yes Sometimes 1: Geographical scale 2: # Nodes 3: Responsiveness 4: Update propagation 5: # Replicas 6: Client-side caching? Iterative name resolution Principle resolve(dir, [name1,..., nameK ]) sent to Server0 responsible for dir .1 Server0 resolves resolve(dir, name1) → dir1, returning the identification .2 (address) of Server1, which stores dir1. Client sends resolve(dir1, [name2 , ..., nameK ]) to Server1, etc. .3 Scalability issues Size scalability We need to ensure that servers can handle a large number of requests per time unit ⇒ high-level servers are in big trouble. Solution Assume (at least at global and administrational level) that content of nodes hardly ever changes. We can then apply extensive replication by mapping nodes to multiple servers, and start name resolution at the nearest server. Observation An important attribute of many nodes is the address where the represented entity can be contacted. Replicating nodes makes large-scale traditional name servers unsuitable for locating mobile entities. Scalability issues We need to ensure that the name resolution process scales across large geographical distances Problem By mapping nodes to servers that can be located anywhere, we introduce an implicit location dependency. DNS (Domain Name Service) Essence Hierarchically organized name space with each node having exactly one • incoming edge ⇒ edge label = node label. domain: a subtree • domain name: a path name to a domain’s root node. • Information in a node Type Refers to Description SOA A MX SRV NS CNAME PTR HINFO TXT Zone Host Domain Domain Zone Node Host Host Any kind Holds info on the represented zone IP addr. of host this node represents Mail server to handle mail for this node Server handling a specific service Name server for the represented zone Symbolic link Canonical name of a host Info on this host Any info considered useful Modern DNS (Domain Name Service) The traditional organization of the implementation of DNS The modern organization of DNS Secure DNS Basic approach Resource records of the same type are grouped into a signed set, per zone. Examples: A set with all the IPv4 addresses of a zone • A set with all the IPv6 addresses of a zone • A set with the name servers of a zone • The public key associated with the secret key used for signing a set of resource records is added to a zone, called a zone-signing key. Trusting the signatures All zone-signing keys are grouped again into a separate set, which is • signed using another secret key. The public key of the latter is the key-signing key. The hash of the key-signing key is stored at, and signed by, the parent • zone Secure DNS Building a trust chain Consider a single set of resource records RR, hashed with HZk and • signed with SKZk SZKk has associated public key ZSKk • (Set of) ZSKk is hashed with HKk and signed with SKKk • SKKk has associated public key KSKk • A client can verify signature SKZ2(HZ2(RR)) by checking ? ZSK2(SKZ2(HZ2(RR))) = HZ 2(RR) Mounting nested directories LDAP (Lightweight Directory Access Protocol ) Essence Directory Information Base: collection of all directory entries in an LDAP service. • Each record is uniquely named as a sequence of naming attributes (called • Relative Distinguished Name), so that it can be looked up. Directory Information Tree: the naming graph of an LDAP directory service; • each node represents a directory entry. Part of a directory information tree LDAP (Lightweight Directory Access Protocol ) Two directory entries having HostName as RDN Attribute Value Attribute Value Locality Amsterdam Locality Amsterdam Organization VU University Organization VU University OrganizationalUnit Computer Science OrganizationalUnit Computer Science CommonName Main server CommonName Main server HostName star HostName zephyr HostAddress 192.31.231.42 HostAddress 137 .37 .20.10 Result of search("(C=NL)(O=VU University)(OU=*)(CN=Main s e r v e r ) " ) Drawbacks of distributed index Quite a few A query involving k attributes requires contacting k servers • Imagine looking up “lastName = Smith ∧firstName = Pheriby ”: the client • may need to process many files as there are so many people named “Smith.” No (easy) support for range queries, such as “price = [1000 − 2500].” • Alternative: map all attributes to 1 dimension and then index Space-filling curves: principle Map the N-dimensional space covered by the N attributes {a1,..., aN } .1 into a single dimension Hashing values in order to distribute the 1-dimensional space among .2 index servers. Hilbert space-filling curve of (a) order 1, and (b) order 4 (a) (b) CS476 Parallel and Distributed Computing Module 5 Analytical Modeling of Parallel Systems 168 Contents 1. Effect of Granularity on Performance 2. Scalability of Parallel Systems 3. Minimum Execution Time and Minimum Cost-Optimal Execution Time 4. Asymptotic Analysis of Parallel Programs 5. Other Scalability Metrics 169 Weekly Learning Outcomes 1. Learn the scalability of Parallel Systems . 2. Discuss about Minimum Execution Time and Minimum CostOptimal Execution Time. 170 Required Reading Chapter 5 Analytical Modeling of Parallel Systems: (Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.) Recommended Reading Granularity in Parallel Computing https://www.youtube.com/watch?v=AlzOErpaXE8 171 Effect of Granularity on Performance ❖Often, using fewer processors improves performance of parallel systems. ❖Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system. ❖A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors. ❖Since the number of processing elements decreases by a factor of n / p, the computation at each processing element increases by a factor of n / p. ❖The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity. Building Granularity: Example • Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2. • Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors. • Each of the p processors is now assigned n / p virtual processors. • The first log p of the log n steps of the original algorithm are simulated in (n / p) log p steps on p processing elements. • Subsequent log n - log p steps do not require any communication. Building Granularity: Example (continued) • The overall parallel execution time of this parallel system is Θ ( (n / p) log p). • The cost is Θ (n log p), which is asymptotically higher than the Θ (n) cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal. Building Granularity: Example (continued) Can we build granularity in the example in a cost-optimal fashion? • Each processing element locally adds its n / p numbers in time Θ (n / p). • The p partial sums on p processing elements can be added in time Θ(n /p). A cost-optimal way of computing the sum of 16 numbers using four processing elements. Building Granularity: Example (continued) • The parallel runtime of this algorithm is (3) • The cost is • This is cost-optimal, so long as ! Scalability of Parallel Systems How do we extrapolate performance from small problems and small systems to larger problems on larger configurations? Consider three parallel algorithms for computing an n-point Fast Fourier Transform (FFT) on 64 processing elements. A comparison of the speedups obtained by the binary-exchange, 2-D transpose and 3-D transpose algorithms on 64 processing elements with tc = 2, tw = 4, ts = 25, and th = 2. Clearly, it is difficult to infer scaling characteristics from observations on small datasets on small machines. Scaling Characteristics of Parallel Programs • The efficiency of a parallel program can be written as: or (4) • The total overhead function To is an increasing function of p . Scaling Characteristics of Parallel Programs ❖For a given problem size (i.e., the value of TS remains constant), as we increase the number of processing elements, To increases. ❖The overall efficiency of the parallel program goes down. This is the case for all parallel programs. Scaling Characteristics of Parallel Programs: Example • Consider the problem of adding numbers on processing elements. • We have seen that: = (5) = (6) = (7) Scaling Characteristics of Parallel Programs: Example (continued) Plotting the speedup for various input sizes gives us: Speedup versus the number of processing elements for adding a list of numbers. Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law Scaling Characteristics of Parallel Programs ❖Total overhead function To is a function of both problem size Ts and the number of processing elements p. ❖ In many cases, To grows sublinearly with respect to Ts. ❖In such cases, the efficiency increases if the problem size is increased keeping the number of processing elements constant. ❖For such systems, we can simultaneously increase the problem size and number of processors to keep efficiency constant. ❖We call such systems scalable parallel systems. Scaling Characteristics of Parallel Programs ❖Recall that cost-optimal parallel systems have an efficiency of Θ(1). ❖Scalability and cost-optimality are therefore related. ❖ A scalable parallel system can always be made cost-optimal if the number of processing elements and the size of the computation are chosen appropriately. Isoefficiency Metric of Scalability ❖For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down for all systems. ❖For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant. Isoefficiency Metric of Scalability Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems. Isoefficiency Metric of Scalability ❖What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed? ❖This rate determines the scalability of the system. The slower this rate, the better. ❖Before we formalize this rate, we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem. Isoefficiency Metric of Scalability • We can write parallel runtime as: (8) • The resulting expression for speedup is (9) • Finally, we write the expression for efficiency as Isoefficiency Metric of Scalability • For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To / W is maintained at a constant value. • For a desired value E of efficiency, (11) • If K = E / (1 – E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have (12) Isoefficiency Metric of Scalability ❖The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant. ❖This function is called the isoefficiency function. ❖This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements Isoefficiency Metric: Example ❖ The overhead function for the problem of adding n numbers on p processing elements is approximately 2p log p . ❖ Substituting To by 2p log p , we get = (13) ❖ Thus, the asymptotic isoefficiency function for this parallel system is . ❖ If the number of processing elements is increased from p to p’, the problem size (in this case, n ) must be increased by a factor of (p’ log p’) / (p log p) to get the same efficiency as on p processing elements. Isoefficiency Metric: Example Consider a more complex example where • Using only the first term of To in Equation 12, we get = (14) • Using only the second term, Equation 12 yields the following relation between W and p: (15) • The larger of these two asymptotic rates determines the isoefficiency. This is given by Θ(p3) Cost-Optimality and the Isoefficiency Function • A parallel system is cost-optimal if and only if (16) • From this, we have: (17) (18) • If we have an isoefficiency function f(p), then it follows that the relation W = Ω(f(p)) must be satisfied to ensure the cost-optimality of a parallel system as it is scaled up. Lower Bound on the Isoefficiency Function • For a problem consisting of W units of work, no more than W processing elements can be used cost-optimally. • The problem size must increase at least as fast as Θ(p) to maintain fixed efficiency; hence, Ω(p) is the asymptotic lower bound on the isoefficiency function. Degree of Concurrency and the Isoefficiency Function ❖The maximum number of tasks that can be executed simultaneously at any time in a parallel algorithm is called its degree of concurrency. ❖If C(W) is the degree of concurrency of a parallel algorithm, then for a problem of size W, no more than C(W) processing elements can be employed effectively. Degree of Concurrency and the Isoefficiency Function: Example Consider solving a system of equations in variables by using Gaussian elimination (W = Θ(n3)) ❖The n variables must be eliminated one after the other, and eliminating each variable requires Θ(n2) computations. ❖At most Θ(n2) processing elements can be kept busy at any time. ❖Since W = Θ(n3) for this problem, the degree of concurrency C(W) is Θ(W2/3) . ❖Given p processing elements, the problem size should be at least Ω(p3/2) to use them all. Minimum Execution Time and Minimum CostOptimal Execution Time Often, we are interested in the minimum time to solution. • We can determine the minimum parallel runtime TPmin for a given W by differentiating the expression for TP w.r.t. p and equating it to zero. =0 (19) • If p0 is the value of p as determined by this equation, TP(p0) is the minimum parallel time. Minimum Execution Time: Example Consider the minimum execution time for adding n numbers. (20) = Setting the derivative w.r.t. p to zero, we have p = n/ 2 . The corresponding runtime is = (21) (One may verify that this is indeed a min by verifying that the second derivative is positive). Note that at this point, the formulation is not costoptimal. Minimum Cost-Optimal Parallel Time ❖Let TPcost_opt be the minimum cost-optimal parallel time. ❖If the isoefficiency function of a parallel system is Θ(f(p)) , then a problem of size W can be solved cost-optimally if and only if ❖ W= Ω(f(p)) . ❖In other words, for cost optimality, p = O(f--1(W)) . ❖For cost-optimal systems, TP = Θ(W/p) , therefore, = (22) Asymptotic Analysis of Parallel Programs Consider the problem of sorting a list of n numbers. The fastest serial programs for this problem run in time Θ(n log n). Consider four parallel algorithms, A1, A2, A3, and A4 as follows: • Comparison of four different algorithms for sorting a given list of numbers. The table shows number of processing elements, parallel runtime, speedup, efficiency and the pTP product. • Asymptotic Analysis of Parallel Programs ❖If the metric is speed, algorithm A1 is the best, followed by A3, A4, and A2 (in order of increasing TP). ❖In terms of efficiency, A2 and A4 are the best, followed by A3 and A1. ❖In terms of cost, algorithms A2 and A4 are cost optimal, A1 and A3 are not. ❖It is important to identify the objectives of analysis and to use appropriate metrics! Other Scalability Metrics ❖A number of other metrics have been proposed, dictated by specific needs of applications. ❖For real-time applications, the objective is to scale up a system to accomplish a task in a specified time bound. ❖In memory constrained environments, metrics operate at the limit of memory and estimate performance under this problem growth rate. Other Scalability Metrics: Scaled Speedup ❖Speedup obtained when the problem size is increased linearly with the number of processing elements. ❖If scaled speedup is close to linear, the system is considered scalable. ❖If the isoefficiency is near linear, scaled speedup curve is close to linear as well. ❖If the aggregate memory grows linearly in p, scaled speedup increases problem size to fill memory. ❖Alternately, the size of the problem is increased subject to an upperbound on parallel execution time. Scaled Speedup: Example ❖The serial runtime of multiplying a matrix of dimension n x n with a vector is tcn2 . ❖For a given parallel algorithm, (24) ❖Total memory requirement of this algorithm is Θ(n2) . Scaled Speedup: Example (continued) Consider the case of memory-constrained scaling. • We have m= Θ(n2) = Θ(p). • Memory constrained scaled speedup is given by or • This is not a particularly scalable system Scaled Speedup: Example (continued) Consider the case of time-constrained scaling. ❖We have TP = O(n2) . ❖Since this is constrained to be constant, n2= O(p) . ❖Note that in this case, time-constrained speedup is identical to memory constrained speedup. ❖This is not surprising, since the memory and time complexity of the operation are identical. Scaled Speedup: Example • The serial runtime of multiplying two matrices of dimension n x n is tcn3. • The parallel runtime of a given algorithm is: (25) • The speedup S is given by: Scaled Speedup: Example (continued) Consider memory-constrained scaled speedup. • We have memory complexity m= Θ(n2) = Θ(p), or n2=c xp. • At this growth rate, scaled speedup S’ is given by: Note that this is scalable. Scaled Speedup: Example (continued) Consider time-constrained scaled speedup. ❖We have TP = O(1) = O(n3 / p) , or n3=c x p . ❖Time-constrained speedup S’’ is given by: Memory constrained scaling yields better performance. Serial Fraction f • If the serial runtime of a computation can be divided into a totally parallel and a totally serial component, we have: • From this, we have, (26) Serial Fraction f • The serial fraction f of a parallel program is defined as: • Therefore, we have: Serial Fraction ❖Since S = W / TP , we have ❖From this, we have: (27) ❖If f increases with the number of processors, this is an indicator of rising overhead, and thus an indicator of poor scalability. Thank You College of Computing and Informatics Assignment 1 Deadline: Tuesday 01/10/2023 @ 23:59 [Total Mark for this Assignment is 8] Student Details: Name: ### ID: ### CRN: ### Instructions: • You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on Blackboard via the allocated folder. These files must not be in compressed format. • It is your responsibility to check and make sure that you have uploaded both the correct files. • Zero mark will be given if you try to bypass the SafeAssign (e.g. misspell words, remove spaces between words, hide characters, use different character sets, convert text into image or languages other than English or any kind of manipulation). • Email submission will not be accepted. • You are advised to make your work clear and well-presented. This includes filling your information on the cover page. • You must use this template, failing which will result in zero mark. • You MUST show all your work, and text must not be converted into an image, unless specified otherwise by the question. • Late submission will result in ZERO mark. • The work should be your own, copying from students or other resources will result in ZERO mark. • Use Times New Roman font for all your answers. Question One Pg. 01 Learning Outcome(s): Recognize the fundamental principles of parallel and distributed processing, parallel system taxonomy, and parallel system performance metrics Question One 2 Marks What are the key characteristics that differentiate a distributed system from a centralized system? Question Two Pg. 02 Learning Outcome(s): Recognize the fundamental principles of parallel and distributed processing, parallel system taxonomy, and parallel system performance metrics Question Two 2 Marks What are the key components of security in a distributed system, and how do they contribute to its dependability? Question Three Pg. 03 Learning Outcome(s): Design algorithms using Dense Matrix, Search Algorithms for Discrete Optimization Problems. Question Three 4 Marks Explain the fundamental principles of parallel algorithm design. Provide examples of real-world problems where parallelism can be effectively applied.