Upload
melvyn-lawrence
View
216
Download
0
Embed Size (px)
Citation preview
Data WarehousingData Warehousing
11
Data Warehousing Data Warehousing Lecture-24Lecture-24
Need for Speed: ParallelismNeed for Speed: Parallelism
Virtual University of PakistanVirtual University of Pakistan
Ahsan AbdullahAssoc. Prof. & Head
Center for Agro-Informatics Researchwww.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, IslamabadEmail: [email protected]
Data WarehousingData Warehousing
22
BackgroundBackground
Data WarehousingData Warehousing
33
When to parallelize?When to parallelize?
Useful for operations that access significant amounts Useful for operations that access significant amounts of data. of data.
Useful for operations that can be implemented Useful for operations that can be implemented independent of each other “Divide-&-Conquer”independent of each other “Divide-&-Conquer”
Parallel execution improves processing for:Parallel execution improves processing for:
Large table scans and joins Large table scans and joins Creation of large indexes Creation of large indexes Partitioned index scans Partitioned index scans Bulk inserts, updates, and deletes Bulk inserts, updates, and deletes Aggregations and copying Aggregations and copying
SizeSize
SizeSize
D&CD&C
SizeSize
D&CD&C
Data WarehousingData Warehousing
44
Are you ready to parallelize?Are you ready to parallelize?Parallelism can be exploited, if there is…Parallelism can be exploited, if there is…
Symmetric multi-processors (SMP), clusters, or Massively Symmetric multi-processors (SMP), clusters, or Massively Parallel (MPP) systems Parallel (MPP) systems ANDAND
Sufficient I/O bandwidth Sufficient I/O bandwidth ANDAND
Underutilized or intermittently used CPUs (for example, Underutilized or intermittently used CPUs (for example, systems where CPU usage is typically less than 30%) systems where CPU usage is typically less than 30%) ANDAND
Sufficient memory to support additional memory-intensive Sufficient memory to support additional memory-intensive
processes such as sorts, hashing, and I/O buffersprocesses such as sorts, hashing, and I/O buffers
Word of cautionWord of caution
Parallelism can reduce system performance on over-utilized systems or systems with small I/O bandwidth.
Data WarehousingData Warehousing
55
Scalability – Size is NOT everythingScalability – Size is NOT everything
Number of Concurrent Number of Concurrent UsersUsers
• Simple table retrieval
• Moderate complexity Join
• Propensity analysis
• Clustering
Complexity of TechniqueComplexity of Technique
• Hash based
• B-Tree
• Multiple
• Bitmapped
Index usageIndex usage
Amount of detailed dataAmount of detailed data
Complexity of Data ModelComplexity of Data Model
Data WarehousingData Warehousing
66
Scalability- Speed-Up & Scale-UpScalability- Speed-Up & Scale-Up
Speed-UpSpeed-UpMore resources meansproportionally less timefor given amount of data.
Scale-UpScale-UpIf resources increased inproportion to increase indata size, time is constant.
Degree of Parallelism
Tra
nsac
tions
/Sec
Degree of Parallelism
Sec
s/T
rans
actio
n
Ideal
Ideal
Real
Data WarehousingData Warehousing
77
Quantifying Speed-upQuantifying Speed-up
Sequential Execution IdealIdeal Parallel Execution
18 time units 6 time units
Task-1 Task-2 Task-3
Control work (“overhead”)
Speedup = 18 = 300% 6
Ts: Time on serial processor Tm: Time on multiple processors
Speedup = Ts
Tm
Tas
k-1
Tas
k-2
Tas
k-3
Data WarehousingData Warehousing
88
Speed-Up & Amdahl’s LawSpeed-Up & Amdahl’s LawReveals maximum expected speedup from parallel algorithms given the proportion of task that must be computed sequentially. It gives the speedup S as
ff is the fraction of the problem that must be computed sequentiallyNN is the number of processors
As ff approaches 0, SS approaches NN
Example-1: f f = 5% and NN = 100 then SS = 16.8
Example-2: f f = 10% and NN = 200 then SS = 9.56
Not1:1
Ratio
Only formula and explanation will go to graphics
Data WarehousingData Warehousing
99
Amdahl’s Law: Limits of parallelizationAmdahl’s Law: Limits of parallelization
For less than 80% parallelism, the speedup drastically drops.
At 90% parallelism, 128128 processors give performance of less than 1010 processors.
1
2
3
4
5
6
7
8
9
10
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
% sequential code (f)
Sp
eed
up
(S
)
N=2 N=4 N=8 N=16 N=32 N=64 N=128
Data WarehousingData Warehousing
1010
Parallelization OLTP Vs. DSSParallelization OLTP Vs. DSS
There is a big difference.
DSSParallelization of a SINGLE query
OLTPParallelization of MULTIPLE queriesOr Batch updates in parallel
Data WarehousingData Warehousing
1111
Brief Intro to Parallel ProcessingBrief Intro to Parallel Processing Parallel Hardware ArchitecturesParallel Hardware Architectures
Symmetric Multi Processing (SMP)Symmetric Multi Processing (SMP) Distributed Memory or Massively Parallel Processing (MPP)Distributed Memory or Massively Parallel Processing (MPP) Non-uniform Memory Access (NUMA)Non-uniform Memory Access (NUMA)
Parallel Software ArchitecturesParallel Software Architectures Shared MemoryShared Memory Shard DiskShard Disk Shared NothingShared Nothing
Types of parallelismTypes of parallelism Data ParallelismData Parallelism Spatial ParallelismSpatial Parallelism
Shared Everything
Data WarehousingData Warehousing
1212
Symmetrical Multi Processing (SMP)Symmetrical Multi Processing (SMP) A number of independent I/O and number of A number of independent I/O and number of
processors all sharing access to a single large memory processors all sharing access to a single large memory space.space.
Main Memory
I/O I/O I/O P1 P2 P3 P4
Typically each CPU executes its job independently.Typically each CPU executes its job independently. Supports both Multi-Tasking and Parallel Processing.Supports both Multi-Tasking and Parallel Processing. Have to deal with issues such as Cache Coherence, Processor Have to deal with issues such as Cache Coherence, Processor
Affinity and Hot Spots.Affinity and Hot Spots.
Yellow will not go to graphics
Data WarehousingData Warehousing
1313
Composed of a number of self-contained, self-controlled nodes Composed of a number of self-contained, self-controlled nodes connected through a network interface. connected through a network interface.
Each node contains its own CPU, processor, memory and I/O.Each node contains its own CPU, processor, memory and I/O.
Architecture better known as Massively Parallel Processing (MPP) Architecture better known as Massively Parallel Processing (MPP) or cluster computing.or cluster computing.
Memory is distributed across all nodes.Memory is distributed across all nodes.
Distributed Memory MachinesDistributed Memory Machines
Bus, Switch or Network
I/O P
Memory
I/O P
Memory
I/O P
Memory
Network has the tendency to become the bottleneck.Network has the tendency to become the bottleneck.
Issues fundamentally different from those in SMP. Issues fundamentally different from those in SMP. Yellow will not go to graphics
Node
Data WarehousingData Warehousing
1414
A little bit of both worlds !A little bit of both worlds !
Distributed Shared Memory MachinesDistributed Shared Memory Machines
Inte
rco
nn
ecti
on
N
etw
ork
Main Memory
I/O I/O I/O P1 P2 P3 P4
Main Memory
I/O I/O I/O P1 P2 P3 P4
Main Memory
I/O I/O I/O P1 P2 P3 P4
Main Memory
I/O I/O I/O P1 P2 P3 P4
Data WarehousingData Warehousing
1515
Shared disk RDBMS ArchitectureShared disk RDBMS Architecture
Clients/Users
Shared DiskInterconnect
AdvHigh level of fault tolerance
Dis AdvSerialization due to locking
Interconnect can become a bottleneck
Yellow ill not go to graphics
Data WarehousingData Warehousing
1616
Shared Nothing RDBMS ArchitectureShared Nothing RDBMS Architecture
Clients/Users
AdvData ownership changes infrequently
There is no locking
Dis AdvData availability low on failure
Very careful with data distribution
Redistribution is expensive
Yellow ill not go to graphics
Data WarehousingData Warehousing
1717
Shared disk Vs. Shared Nothing RDBMSShared disk Vs. Shared Nothing RDBMS
Important note:Important note: Do not confuse RDBMS Do not confuse RDBMS architecture with hardware architecture.architecture with hardware architecture.
Shared nothing databases can run on shared Shared nothing databases can run on shared everything (SMP or NUMA) hardware. everything (SMP or NUMA) hardware.
Shared disk databases can run on shared Shared disk databases can run on shared nothing (MPP) hardware.nothing (MPP) hardware.
This slide will not go to graphics