Upload
leliem
View
216
Download
0
Embed Size (px)
Citation preview
Dr. O’Neil Smith, Harris Corporation
Research Applications
Problem Statement and Applications
History of High Performance Computing
Vision for FIT Electrical & Computer Eng. Department
Current State of the Art
Classification & Ecosystem
Accelerator Technology
A ZETTABYTE
IS
ONE
MILLION
PETABYTES!
The problem
1930
•Information base 2x every 30yrs
1970
•Information base 2x every 7yrs
2000
•Information base 2x every 4yrs
2012
•5 exabytes / 2 days
•450M Twitter per day
~ 2020
•Estimated 35 zettabytes of data generated annually
“Every two days now we create as
much information as we did from the
dawn of civilization up until 2003.”
~ Eric Schmidt, Executive Chair, Google
The problem
A process of inspecting, cleaning,
transforming and modeling data
with the goal of highlighting useful
information, suggesting conclusions
and supporting decision making
The process of developing optimal or
realistic decision recommendations
based on insights derived through
the application of statistical models
and analysis against existing and/or
simulated future data
3/23/2014
Analytics
Data Mining / Predictive
Analytics
Visual Analytics
Text Analytics Search & Discovery
Statistics
Patterns
of Life Networks Time
Geo-
location Activity Events Objects
The problem
IBM Blue Gene P
Deep Blue
Boise State HPC
University of Southern Cal
University of Utah
Kraken, University of Tennessee
Traditional technologies just can’t keep up; they aren’t fast enough. All this processing
requires a much higher-performance computing solution. This is where the HPC plays a role
and accelerates processing for whether, video, image, or signal data.
BlueShark, Florida Institute of Technology
State of The Art
State of The Art
State of The Art
Complete more computations per second .
High resolution visualization.
Provides a more powerful computing resource to enhance the speed and agility of researchers for faster
insight and discovery.
Research that requires computers to run heavy processing workloads such as modeling, molecular
simulation, advanced mathematics, and computational chemistry.
Applications for these domains have to rely on compute-intensive analytics methods and
high-performance computing.
Training and simulation
On-board systems for navigation
Defense, and attack
Command, control, communications
Intelligence
Surveillance
Satellite image processing
Signal and intelligence processing,
HPC Industry Domains
High Performance Computing
State of The Art
1990 1995 2000 2005 2010 2015 1970 1975 1980 1985
1976: Cray 1, one of the
most successful
supercomputers in history
operating at 80MHz providing
a peak performance of
250MFLOPS at 115kW.
1985: Cray 2, a
supercomputer consisting
of four vector processors
with a peak performance
of 1.9 GFLOPS at 115-
120kW became the
fastest supercomputer
until 1990s.
1997: Deep Blue, was a chess-
playing computer developed
by IBM.
2002: Earth Simulator,
supercomputer built by NEC at the
Japan Agency for
Marine-Earth Science and Technology
(JAMSTEC) reached 131 teraflops,
using 640 nodes, each with eight
proprietary vector processing chips
2005 : Jaguar, built by Cray at Oak Ridge
National Laboratory (ORNL) The massively
parallel Jaguar had a peak performance of just
over 1,750 teraFLOPS (1.75 petaFLOPS).
2012: Blue Waters ,is
a petascale supercomput
er at the National Center
for Supercomputing
Applications (NCSA) at
the University of Illinois
2009: Blue Gene, is an IBM project
aimed at designing supercomputers
that can reach operating speeds in
the PFLOPS (petaFLOPS) range, with
low power consumption.
History
http://www.olcf.ornl.gov/titan/
November 2013, #2
History
General-purpose computing
on graphics processing
units (GPGPU) is a fairly recent
trend in computer engineering
research. GPUs are co-
processors that have been
heavily optimized.
GPGPU
An field-programmable gate
arrays (FPGA) is, in essence, a
computer chip that can rewire
itself for a given task. FPGAs
can be programmed
with hardware description
languages such as VHDL or
Verilog.
FPGA
An massively parallel
programming (MPP)
supercomputer usually implies
a faster propitiatory very fast
interconnect that supports
either Distributed Shared
Memory or even a Single
System Image.
MPP
A symmetric multiprocessing
(SMP) computer hardware
architecture has two or more
identical processors that are
connected to a single shared
main memory and controlled by
a single OS instance.
SMP
A bunch of machines, normally
usually Ethernet interconnect
(read: network), each running
it's own and separate copy of
an OS which happen to serve a
single purpose.
Cluster
The grid can be thought of as
a distributed system with non-
interactive workloads that
involve a large number of files
Grid
HPC CLASSIFICATIONS
Class & Eco
GPGPU Programming
Cuda
OpenCL
FPGA Programming
VHDL
Verilog
System C
Shared Memory APIs
POSIX Threads
OpenMP
Message Passing System
Message Passing Interface
(MPI)
Class & Eco
Local
DDR3
Local
DDR3
Local
DDR3
Intel Core i7
I/O
HU
B
QPI
PCIe
C C
C C
HOST
MEMORY
MULTICORE
CPU
COMMUNICATION
INTERFACE HIGH PERFORMANCE COMPUTING
Nvidia Tesla K Series
Glo
ba
l Me
mo
ry
Co
nsta
nt
Me
mo
ry
Inte
rco
nn
ecti
on
SP1
SP32
…
Registers
(32kB)
Registers
(32kB)
Sh
are
d M
em
ory
(64
KB
)
Stre
am
ing
Mu
ltipro
ce
sso
r 1
…
SP1
SP32
…
Registers
(32kB)
Registers
(32kB)
Sh
are
d M
em
ory
(64
KB
)
Stre
am
ing
Mu
ltipro
ce
sso
r 1
Technology
CUDA architecture and the hardware architecture. NVIDIA has created this specialized
architecture to achieve the massively parallel systems that we have today.
Direct X Compute
Open CL Driver C Runtime for CUDA
CUDA Support in OS Kernel
CUDA Parallel Computing Engines
Inside NVIDIA GPUs
CUDA Driver Parallel Thread Execution
(ISA)
Applications Using
Direct X
HLSL
Compute
Shaders
Applications Using
the CUDA Driver API
C for CUDA
Compute
Kernels
Applications Using
OpenCL
Open CL
Compute
Kernels
Applications Using C,
C++, Fortran, Java,
Python, …
C for CUDA
Compute
Kernels
Technology
4 3
2
1
Kepler GK110 Full chip block diagram
Traditional Multi Core
VS
Technology
Streaming Multiprocessor
Streaming Multiprocessor (SMX)
• 32 load/store units (LD/ST).
• 32 special function units (SFU)
• 64 double‐precision units
• 192 single‐precision CUDA cores
Technology
Threads/Warp 32
Max Warps 64
Max Threads 2048
Thread per-Thread private
Local Memory
per-Block
Shared
Memory
Block
per-
Application
Context
Global
Memory
… Grid 0
… Grid 1
Technology
High performance computer programs are more difficult to write
than sequential ones, because concurrency introduces several
new classes of potential software bugs:
Race conditions (most common)
Communication and synchronization
The maximum possible speed-up of a single program as a result of
parallelization.
Research
3/23/2014
L1 Point Clouds Calculate the line
of sight (LOS)
Clip looks along
the LOS
GPU Process Steps
PSF LOS per look PSF aggregate per
pass
Line of Sight Local Density Comparison Aggregate Local Density Comparison
D2
D1 D1
• Compare the density about each point in
a look using the point spread function in
a sphere (D1)
• Compare the density estimated along a
modified shape (D2) extending along the
line of sight, toward the sensor.
• This removes noise below surfaces, and
away from surfaces.
• Compare the density about
each point using the point
spread function on two
spheres of differing radii.
• This removes noise near
surfaces.
D2
Research
CPU Kepler GPU
GPU Adapts to Data, Dynamically Launches New Threads
Research
CPU process
GPU process
Independent study by “The MITRE Corporation”, and “OG Systems”
Performance Improvement
30x to 70x faster
Research
LiDAR Data 1
LiDAR Data 4
LiDAR Data 3
LiDAR Data 2
In collaboration with FIT ICE Lab, Co-Director Dr. Adrian Peter
Objective: Simultaneously register multiple LiDAR collects
Research
GPU Optimized Algorithms
Collect Corresponding
Point Sets
Cluster Point Sets
Calculate Curvatures at
Cluster Centers
Create Histogram based on
Distance from Center to other
Points
Match Centers Based on
Curvature and Histogram
Gaussian Clustering
Eigenvalue curvature calculations
3D Histograms
Euclidean distance matching
Research
Video Stabilization
Noise Filtering
Template Matching
Optical Flow
Track Prediction
Distributed Network Real-time Video
Tracking
GP
U O
pti
miz
ed
Alg
ori
thm
s
Video analytics - A wealth of image and video
data is being collected via satellite,
unmanned aerial vehicles (UAVs), and other
devices.
Performance Improvement
Before
Video post
processing
After
Near real-time
28 – 33 fps
Research
Multiclass formulation where
We have category basis vectors with Kwww ,,, 21 W
and the orthonormal constraints K
TIWW
k
K
k
k
T
k wRwE
12
1min W
W
The objective function is
where
k
i
Ci
T
ik xxRdef
WSKkwRwRwR ,,, 2211
W1SR is positive semi-definite
2K
K
The necessary condition
The sufficient condition
symmetric is and SIWW 2T
Research
Algorithm Name Abbr.
* Category Vector Space
(Quadratic) CQS
* Kernel Category Vector Space
(Quadratic) K-CQS
* Category Vector Space
(L1 Norm) CAS
Principal Component Analysis PCA
* Kernel Category Vector Space
(L1 Norm) K-CAS
Least Squares Linear
Discriminant Analysis LS-LDA
Multiclass Fisher
Linear Discriminant MC-FLD
Kernel Multiclass Fisher
Linear Discriminant
K-MC-
FLD
Data Set Name(s) # Classes # Attributes
Gaussian 3 (300) 3
CalTech 6 (~20) 5400
Coil 20 (72) 16384
English alphabet 13 (734 - 805) 16
Movement libras 15 (24) 90
Handwritten digits 10 (554 - 572) 64
Poker (Four,Royal,Straight)) 3 (8,17,236) 10
Satellite 6 (626 - 1533) 36
Segmentation 7 (330) 19
Vertebral 3 (60 - 150) 6
Shuttle 7 (13 – 45586) 9
Texture 11 (250) 40
Iris 3 (150) 4
Seeds 3 (70) 7
Wine 3 (48, 59, 71) 13
Vowel 10 (90) 10
RaceSpace (female) 5 (65 - 635) 10010
RaceSpace (male) 5 (32 - 3451) 10010
*PhD research contributions
Research
Algorithm English
alphabet
Satellite Iris Seeds Wine Vowel Handwritten
digits
Movement
libras
CQS * 87.6 ±0.8 86.4 ±1.0 97.0 ±1.3 95.0 ±3.0 84.2 ±4.4 85.2 ±1.7 96.6 ±0.5 86.5 ±3.5
CAS * 87.4 ±1.1 86.4 ±1.2 96.6 ±1.1 95.2 ±2.7 84.5 ±5.0 85.1 ±1.9 96.6 ±0.8 87.1 ±1.6
PCA 86.2 ±1.2 82.6 ±1.7 96.6 ±1.6 92.6 ±1.5 82.8 ±3.7 82.0 ±1.5 96.1 ±0.6 85.4 ±3.9
LS-LDA 86.8 ±0.3 86.0 ±1.3 97.0 ±2.4 95.4 ±1.7 92.5 ±3.0 83.0 ±2.8 95.5 ±0.4 87.5 ±3.8
MC-FLD 85.9 ±1.2 84.4 ±2.0 97.3 ±1.9 96.6 ±0.9 98.2 ±1.5 77.2 ±2.5 94.9 ±0.6 87.2 ±4.9
K-CQS * 77.9 ±4.5 83.6 ±1.6 93.5 ±2.5 91.3 ±3.1 67.4 ±5.8 80.1 ±3.3 90.8 ±1.3 92.1 ±6.2
K-CAS * 82.0 ±2.4 85.4 ±1.5 94.4 ±2.7 90.7 ±2.7 68.0 ±4.9 81.5 ±3.4 92.7 ±0.8 70.7 ±29.8
K-MC-FLD 71.7 ±28.8 67.2 ±1.2 77.5 ±10.5 75.4 ±26.9 65.8 ±29.2 96.4 ±1.2 54.0 ±16.2 24.3 ±16.1
Experiments & Results
Accuracy(% ± std)
Research
Other accelerator technologies are emerging Intel: Xeon Phi Coprocessor
AMD: FireStream
GPU adoption is increasing 53 systems on the Top500 released in Nov, 2013 are using GPGPUs
GPGPUs are penetrating both high-end and mainstream HPC
Nvidia is leading the accelerator race 100’s of K’s of trained CUDA developers worldwide
50 systems powered by Nvidia on the latest Top500 list
Vision
2008 2013
Vision
Heterogeneous High Performance Computing
Teaching Center Research Center
CUDA Center of Excellence
(CCOE)
Hybrid systems benefit from the best characteristics of each (FPGA, GPU,
and CPU), including: low latency, power efficiency, attractive performance
per dollar, longer product life, customization, and the efficient use of diverse
processors.
Vision
Local
DDR3
Local
DDR3
Local
DDR3
Intel Core i7
I/O
HU
B
QPI
PCIe
C C
C C
HOST
MEMORY
MULTICORE
CPU
COMMUNICATION
INTERFACE HIGH PERFORMANCE COMPUTING
Nvidia Tesla K Series
Glo
ba
l Me
mo
ry
Co
nsta
nt
Me
mo
ry
Inte
rco
nn
ecti
on
SP1
SP32
…
Registers
(32kB)
Registers
(32kB)
Sh
are
d M
em
ory
(64
KB
)
Stre
am
ing
Mu
ltipro
ce
sso
r 1
…
SP1
SP32
…
Registers
(32kB)
Registers
(32kB)
Sh
are
d M
em
ory
(64
KB
)
Stre
am
ing
Mu
ltipro
ce
sso
r 1
PCIe
M-503 Module 1
Virtex 6 LX240
PC
Ie S
wit
ch
PCIe
PCIe
BRAM
Pico Computing EX-500 Board
DSPs
Logic
Cells
DD
R3
AXI
Bus
M-503 Module 2
Virtex 6 LX240
BRAM
DSPs
Logic
Cells
DD
R3
AXI
Bus
GPU
FPGA
Vision
Reconfigurable Computing
Graduate level reconfigurable computing based on advanced technology in
FPGA. Topics: reconfigurable concepts, device architecture, design tools,
metrics and kernels, application case studies.
Architectures for High Performance Computer
Design and quantitative performance analysis of high speed computer systems,
networks, and interconnect hardware/software interfaces.
High Performance Computer Programming Concepts
Surveys HPC programming language concepts and design principles of
programming paradigms (procedural, functional, and logic).
Vision
ORAU/ORNL High Performance Computing (HPC) Grant Program
National Science Foundation - High Performance Computing System
Acquisition: Continuing the Building of a More Inclusive Computing
Environment for Science and Engineering
NVIDIA Research Programs
Department of the Navy - Simulation Training and Technology Research
and Development
Department of Defense - Naval Research Laboratory – NRL
Big Data Defense Advanced Research Projects Agency
National Institutes of Health (biomedical Big Data)
National Science Foundation - Big Data Research Initiative
Vision
1. Amdahl, Gene M. (1967). "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): 483–
485. doi:10.1145/1465482.1465560
2. Hennessy, John L.; Patterson, David A., Larus, James R. (1999).Computer organization and design : the hardware/software interface (2. ed., 3rd print.
ed.). San Francisco: Kaufmann. ISBN 1-55860-428-6.
3. * Smith, O., Stark, R., Smith, P., St Romain, R., Blask, S. (2013). Point Spread Function (PSF) Noise Filter Strategy for Geiger Mode LiDAR, SPIE
4. Fouche, D. G. (2003). Detection and false-alarm probabilities for laser radars that use Geiger-mode detectors. Applied optics, 42(27), 5388-5398.
5. McIntosh, K. A., Donnelly, J. P., Oakley, D. C., Napoleone, A., Calawa, S. D., Mahoney, L. J., Shaver, D. C., (2002). InGaAsP/InP avalanche photodiodes for
photon counting at 1.06 μm. Applied Physics Letters, 81(14), 2505-2507.
6. Younger, R. D., McIntosh, K. A., Chludzinski, J. W., Oakley, D. C., Mahoney, L. J., Funk, J. E., Verghese, S. (2009, May). Crosstalk analysis of integrated
Geiger-mode avalanche photodiode focal plane arrays. In SPIE Defense, Security, and Sensing (pp. 73200Q-73200Q). International Society for Optics
and Photonics.
7. Soares, E. J., Germino, K., Glick, S. J., & Stodilka, R. Z. (2001). Determination of three-dimensional voxel sensitivity for two-and three-headed coincidence
imaging. In Nuclear Science Symposium Conference Record, 2001 IEEE (Vol. 4, pp. 2058-2061). IEEE.
8. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise.
InProceedings of the 2nd International Conference on Knowledge Discovery and Data mining (Vol. 1996, pp. 226-231). AAAI Press.
9. http://www.nvidia.com/object/tesla-servers.html. NVIDIA. [Online]
10.Leite, P., Teixeira, J. M. X. N., de Farias, T. S. M. C., Teichrieb, V., & Kelner, J. (2009, October). Massively Parallel Nearest Neighbor Queries for Dynamic
Point Clouds on the GPU. In Computer Architecture and High Performance Computing, 2009. SBAC-PAD'09. 21st International Symposium on (pp. 19-
25). IEEE. 11. * Smith, O. and Rangrajan, A., “Category Spaces: A New Approach to Supervised Dimensionality Reduction”, Journal of Machine Learning, in preparation, 2014.
12.B. Scholkopf and A. J. Smola, Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, 2002).
13.P. Marcotte and G. Savard, "Novel Approaches To The Discrimination Problem", Mathematical Methods of Operations Research 36, 6 (1992), pp. 517--
545.
14.S. D. Bay. “Combining nearest neighbor classifiers through multiple feature subsets”, In Proceedings of the 17th International Conference on Machine
Learning, (1998), pp. 37—45.
15.C. M. Bishop. “Neural Networks for Pattern Recognition”, (1995) Oxford University Press
16. R. A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems", Annals of Eugenics 7 (1936), pp. 179-188.
17. C. R. Rao and S. K. Mitra. “Generalized Inverse of Matrices and Its Applications”, (1971)
18. V. Vapnik, Statistical Learning Theory (Wiley Interscience, 1998).
19. K. Fukunaga, Introduction to Statistical Pattern Recognition2nd (Academic Press, 1990).
20. J. Ye, "Least Squares Linear Discriminant Analysis", in International Conference on Machine learning (ACM, 2007), pp. 7
21. J. Weston and C. Watkins, "Multi-class Support Vector Machines", Department of Computer Science, Royal Holloway, University of London (1998).
22. K. Crammer and Y. Singer, "On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines", The Journal of Machine Learning
Research 2 (2002), pp. 265-292.
23. U. Kressel, "Pairwise Classification and Support Vector Machines", in Advances in Kernel Methods (MIT Press, 1999), pp. 255-268.
24. T. Hastie and R. Tibshirani, "Classification by Pairwise Coupling", Advances in Neural Information Processing Systems 10 (1998).
25. D. Widdows and S. Peters, "Word Vectors and Quantum Logic: Experiments with Negation and Disjunction", in Mathematics of Language Conference
vol. 8, (, 2003), pp. 13 pp..
26. D. Widdows, Geometry and Meaning (Center for the Study of Language and Information, 2004).
27. E. H. Rosch, "Natrual Categories", Cognitive Psychology 4 (1973), pp. 328-350.
28. M. Bolla and G. Michaletzky and G. Tusnady and M. Ziermann, "Extrema of Sums of Heterogeneous Quadratic Forms", Linear Algebra and Its
Applications 269, 1-3 (1998), pp. 331-365.
29. T. Rapcsak, "On Minimization on Stiefel Manifolds", European Journal of Operational Research 143, 2 (2002), pp. 365-376.