Dr. O’Neil Smith, Harris Corporationmy.fit.edu/seces/slides/86.pdf · A process of inspecting, cleaning, transforming and modeling data with the goal of highlighting useful information,

Dr. O’Neil Smith, Harris Corporation

Research Applications

Problem Statement and Applications

History of High Performance Computing

Vision for FIT Electrical & Computer Eng. Department

Current State of the Art

Classification & Ecosystem

Accelerator Technology

A ZETTABYTE

IS

ONE

MILLION

PETABYTES!

The problem

1930

•Information base 2x every 30yrs

1970


2000


2012

•5 exabytes / 2 days

•450M Twitter per day

~ 2020

•Estimated 35 zettabytes of data generated annually

“Every two days now we create as

much information as we did from the

dawn of civilization up until 2003.”

~ Eric Schmidt, Executive Chair, Google

The problem

A process of inspecting, cleaning,

transforming and modeling data

with the goal of highlighting useful

information, suggesting conclusions

and supporting decision making

The process of developing optimal or

realistic decision recommendations

based on insights derived through

the application of statistical models

and analysis against existing and/or

simulated future data

3/23/2014

Analytics

Data Mining / Predictive

Analytics

Visual Analytics

Text Analytics Search & Discovery

Statistics

Patterns

of Life Networks Time

Geo-

location Activity Events Objects

The problem

3/23/2014

http://blogs.forbes.com/davefeinleib/

The problem

http://blogs.forbes.com/davefeinleib/

IBM Blue Gene P

Deep Blue

Boise State HPC

University of Southern Cal

University of Utah

Kraken, University of Tennessee

Traditional technologies just can’t keep up; they aren’t fast enough. All this processing

requires a much higher-performance computing solution. This is where the HPC plays a role

and accelerates processing for whether, video, image, or signal data.

BlueShark, Florida Institute of Technology

State of The Art

State of The Art

State of The Art

Complete more computations per second .

High resolution visualization.

Provides a more powerful computing resource to enhance the speed and agility of researchers for faster

insight and discovery.

Research that requires computers to run heavy processing workloads such as modeling, molecular

simulation, advanced mathematics, and computational chemistry.

Applications for these domains have to rely on compute-intensive analytics methods and

high-performance computing.

Training and simulation

On-board systems for navigation

Defense, and attack

Command, control, communications

Intelligence

Surveillance

Satellite image processing

Signal and intelligence processing,

HPC Industry Domains

High Performance Computing

State of The Art

1990 1995 2000 2005 2010 2015 1970 1975 1980 1985

1976: Cray 1, one of the

most successful

supercomputers in history

operating at 80MHz providing

a peak performance of

250MFLOPS at 115kW.

1985: Cray 2, a

supercomputer consisting

of four vector processors

with a peak performance

of 1.9 GFLOPS at 115-

120kW became the

fastest supercomputer

until 1990s.

1997: Deep Blue, was a chess-

playing computer developed

by IBM.

2002: Earth Simulator,

supercomputer built by NEC at the

Japan Agency for

Marine-Earth Science and Technology

(JAMSTEC) reached 131 teraflops,

using 640 nodes, each with eight

proprietary vector processing chips

2005 : Jaguar, built by Cray at Oak Ridge

National Laboratory (ORNL) The massively

parallel Jaguar had a peak performance of just

over 1,750 teraFLOPS (1.75 petaFLOPS).

2012: Blue Waters ,is

a petascale supercomput

er at the National Center

for Supercomputing

Applications (NCSA) at

the University of Illinois

2009: Blue Gene, is an IBM project

aimed at designing supercomputers

that can reach operating speeds in

the PFLOPS (petaFLOPS) range, with

low power consumption.

History

http://www.olcf.ornl.gov/titan/

November 2013, #2

History

General-purpose computing

on graphics processing

units (GPGPU) is a fairly recent

trend in computer engineering

research. GPUs are co-

processors that have been

heavily optimized.

GPGPU

An field-programmable gate

arrays (FPGA) is, in essence, a

computer chip that can rewire

itself for a given task. FPGAs

can be programmed

with hardware description

languages such as VHDL or

Verilog.

FPGA

An massively parallel

programming (MPP)

supercomputer usually implies

a faster propitiatory very fast

interconnect that supports

either Distributed Shared

Memory or even a Single

System Image.

MPP

A symmetric multiprocessing

(SMP) computer hardware

architecture has two or more

identical processors that are

connected to a single shared

main memory and controlled by

a single OS instance.

SMP

A bunch of machines, normally

usually Ethernet interconnect

(read: network), each running

it's own and separate copy of

an OS which happen to serve a

single purpose.

Cluster

The grid can be thought of as

a distributed system with non-

interactive workloads that

involve a large number of files

Grid

HPC CLASSIFICATIONS

Class & Eco

GPGPU Programming

Cuda

OpenCL

FPGA Programming

VHDL

Verilog

System C

Shared Memory APIs

POSIX Threads

OpenMP

Message Passing System

Message Passing Interface

(MPI)

Class & Eco

Local

DDR3

Local

DDR3

Local

DDR3

Intel Core i7

I/O

HU

B

QPI

PCIe

C C

C C

HOST

MEMORY

MULTICORE

CPU

COMMUNICATION

INTERFACE HIGH PERFORMANCE COMPUTING

Nvidia Tesla K Series

Glo

ba

l Me

mo

ry

Co

nsta

nt

Me

mo

ry

Inte

rco

nn

ecti

on

SP1

SP32

…

Registers

(32kB)

Registers

(32kB)

Sh

are

d M

em

ory

(64

KB

)

Stre

am

ing

Mu

ltipro

ce

sso

r 1

…

SP1

SP32

…

Registers

(32kB)

Registers

(32kB)

Sh

are

d M

em

ory

(64

KB

)

Stre

am

ing

Mu

ltipro

ce

sso

r 1

Technology

CUDA architecture and the hardware architecture. NVIDIA has created this specialized

architecture to achieve the massively parallel systems that we have today.

Direct X Compute

Open CL Driver C Runtime for CUDA

CUDA Support in OS Kernel

CUDA Parallel Computing Engines

Inside NVIDIA GPUs

CUDA Driver Parallel Thread Execution

(ISA)

Applications Using

Direct X

HLSL

Compute

Shaders

Applications Using

the CUDA Driver API

C for CUDA

Compute

Kernels

Applications Using

OpenCL

Open CL

Compute

Kernels

Applications Using C,

C++, Fortran, Java,

Python, …

C for CUDA

Compute

Kernels

Technology

4 3

2

1

Kepler GK110 Full chip block diagram

Traditional Multi Core

VS

Technology

Streaming Multiprocessor

Streaming Multiprocessor (SMX)

• 32 load/store units (LD/ST).

• 32 special function units (SFU)

• 64 double‐precision units

• 192 single‐precision CUDA cores

Technology

Threads/Warp 32

Max Warps 64

Max Threads 2048

Thread per-Thread private

Local Memory

per-Block

Shared

Memory

Block

per-

Application

Context

Global

Memory

… Grid 0

… Grid 1

Technology

High performance computer programs are more difficult to write

than sequential ones, because concurrency introduces several

new classes of potential software bugs:

Race conditions (most common)

Communication and synchronization

The maximum possible speed-up of a single program as a result of

parallelization.

Research

3/23/2014

L1 Point Clouds Calculate the line

of sight (LOS)

Clip looks along

the LOS

GPU Process Steps

PSF LOS per look PSF aggregate per

pass

Line of Sight Local Density Comparison Aggregate Local Density Comparison

D2

D1 D1

• Compare the density about each point in

a look using the point spread function in

a sphere (D1)

• Compare the density estimated along a

modified shape (D2) extending along the

line of sight, toward the sensor.

• This removes noise below surfaces, and

away from surfaces.

• Compare the density about

each point using the point

spread function on two

spheres of differing radii.

• This removes noise near

surfaces.

D2

Research

CPU Kepler GPU

GPU Adapts to Data, Dynamically Launches New Threads

Research

CPU process

GPU process

Independent study by “The MITRE Corporation”, and “OG Systems”

Performance Improvement

30x to 70x faster

Research

LiDAR Data 1

LiDAR Data 4

LiDAR Data 3

LiDAR Data 2

In collaboration with FIT ICE Lab, Co-Director Dr. Adrian Peter

Objective: Simultaneously register multiple LiDAR collects

Research

GPU Optimized Algorithms

Collect Corresponding

Point Sets

Cluster Point Sets

Calculate Curvatures at

Cluster Centers

Create Histogram based on

Distance from Center to other

Points

Match Centers Based on

Curvature and Histogram

Gaussian Clustering

Eigenvalue curvature calculations

3D Histograms

Euclidean distance matching

Research

Video Stabilization

Noise Filtering

Template Matching

Optical Flow

Track Prediction

Distributed Network Real-time Video

Tracking

GP

U O

pti

miz

ed

Alg

ori

thm

s

Video analytics - A wealth of image and video

data is being collected via satellite,

unmanned aerial vehicles (UAVs), and other

devices.

Performance Improvement

Before

Video post

processing

After

Near real-time

28 – 33 fps

Research

Multiclass formulation where

We have category basis vectors with Kwww ,,, 21 W

and the orthonormal constraints K

TIWW

k

K

k

k

T

k wRwE

12

1min W

W

The objective function is

where

k

i

Ci

T

ik xxRdef

WSKkwRwRwR ,,, 2211

W1SR is positive semi-definite

2K

K

The necessary condition

The sufficient condition

symmetric is and SIWW 2T

Research

Algorithm Name Abbr.

* Category Vector Space

(Quadratic) CQS

* Kernel Category Vector Space

(Quadratic) K-CQS

* Category Vector Space

(L1 Norm) CAS

Principal Component Analysis PCA

* Kernel Category Vector Space

(L1 Norm) K-CAS

Least Squares Linear

Discriminant Analysis LS-LDA

Multiclass Fisher

Linear Discriminant MC-FLD

Kernel Multiclass Fisher

Linear Discriminant

K-MC-

FLD

Data Set Name(s) # Classes # Attributes

Gaussian 3 (300) 3

CalTech 6 (~20) 5400

Coil 20 (72) 16384

English alphabet 13 (734 - 805) 16

Movement libras 15 (24) 90

Handwritten digits 10 (554 - 572) 64

Poker (Four,Royal,Straight)) 3 (8,17,236) 10

Satellite 6 (626 - 1533) 36

Segmentation 7 (330) 19

Vertebral 3 (60 - 150) 6

Shuttle 7 (13 – 45586) 9

Texture 11 (250) 40

Iris 3 (150) 4

Seeds 3 (70) 7

Wine 3 (48, 59, 71) 13

Vowel 10 (90) 10

RaceSpace (female) 5 (65 - 635) 10010

RaceSpace (male) 5 (32 - 3451) 10010

*PhD research contributions

Research

Algorithm English

alphabet

Satellite Iris Seeds Wine Vowel Handwritten

digits

Movement

libras

CQS * 87.6 ±0.8 86.4 ±1.0 97.0 ±1.3 95.0 ±3.0 84.2 ±4.4 85.2 ±1.7 96.6 ±0.5 86.5 ±3.5

CAS * 87.4 ±1.1 86.4 ±1.2 96.6 ±1.1 95.2 ±2.7 84.5 ±5.0 85.1 ±1.9 96.6 ±0.8 87.1 ±1.6

PCA 86.2 ±1.2 82.6 ±1.7 96.6 ±1.6 92.6 ±1.5 82.8 ±3.7 82.0 ±1.5 96.1 ±0.6 85.4 ±3.9

LS-LDA 86.8 ±0.3 86.0 ±1.3 97.0 ±2.4 95.4 ±1.7 92.5 ±3.0 83.0 ±2.8 95.5 ±0.4 87.5 ±3.8

MC-FLD 85.9 ±1.2 84.4 ±2.0 97.3 ±1.9 96.6 ±0.9 98.2 ±1.5 77.2 ±2.5 94.9 ±0.6 87.2 ±4.9

K-CQS * 77.9 ±4.5 83.6 ±1.6 93.5 ±2.5 91.3 ±3.1 67.4 ±5.8 80.1 ±3.3 90.8 ±1.3 92.1 ±6.2

K-CAS * 82.0 ±2.4 85.4 ±1.5 94.4 ±2.7 90.7 ±2.7 68.0 ±4.9 81.5 ±3.4 92.7 ±0.8 70.7 ±29.8

K-MC-FLD 71.7 ±28.8 67.2 ±1.2 77.5 ±10.5 75.4 ±26.9 65.8 ±29.2 96.4 ±1.2 54.0 ±16.2 24.3 ±16.1

Experiments & Results

Accuracy(% ± std)

Research

Other accelerator technologies are emerging Intel: Xeon Phi Coprocessor

AMD: FireStream

GPU adoption is increasing 53 systems on the Top500 released in Nov, 2013 are using GPGPUs

GPGPUs are penetrating both high-end and mainstream HPC

Nvidia is leading the accelerator race 100’s of K’s of trained CUDA developers worldwide

50 systems powered by Nvidia on the latest Top500 list

Vision

2008 2013

Vision

Heterogeneous High Performance Computing

Teaching Center Research Center

CUDA Center of Excellence

(CCOE)

Hybrid systems benefit from the best characteristics of each (FPGA, GPU,

and CPU), including: low latency, power efficiency, attractive performance

per dollar, longer product life, customization, and the efficient use of diverse

processors.

Vision

Local

DDR3

Local

DDR3

Local

DDR3

Intel Core i7

I/O

HU

B

QPI

PCIe

C C

C C

HOST

MEMORY

MULTICORE

CPU

COMMUNICATION

INTERFACE HIGH PERFORMANCE COMPUTING

Nvidia Tesla K Series

Glo

ba

l Me

mo

ry

Co

nsta

nt

Me

mo

ry

Inte

rco

nn

ecti

on

SP1

SP32

…

Registers

(32kB)

Registers

(32kB)

Sh

are

d M

em

ory

(64

KB

)

Stre

am

ing

Mu

ltipro

ce

sso

r 1

…

SP1

SP32

…

Registers

(32kB)

Registers

(32kB)

Sh

are

d M

em

ory

(64

KB

)

Stre

am

ing

Mu

ltipro

ce

sso

r 1

PCIe

M-503 Module 1

Virtex 6 LX240

PC

Ie S

wit

ch

PCIe

PCIe

BRAM

Pico Computing EX-500 Board

DSPs

Logic

Cells

DD

R3

AXI

Bus

M-503 Module 2

Virtex 6 LX240

BRAM

DSPs

Logic

Cells

DD

R3

AXI

Bus

GPU

FPGA

Vision

Reconfigurable Computing

Graduate level reconfigurable computing based on advanced technology in

FPGA. Topics: reconfigurable concepts, device architecture, design tools,

metrics and kernels, application case studies.

Architectures for High Performance Computer

Design and quantitative performance analysis of high speed computer systems,

networks, and interconnect hardware/software interfaces.

High Performance Computer Programming Concepts

Surveys HPC programming language concepts and design principles of

programming paradigms (procedural, functional, and logic).

Vision

ORAU/ORNL High Performance Computing (HPC) Grant Program

National Science Foundation - High Performance Computing System

Acquisition: Continuing the Building of a More Inclusive Computing

Environment for Science and Engineering

NVIDIA Research Programs

Department of the Navy - Simulation Training and Technology Research

and Development

Department of Defense - Naval Research Laboratory – NRL

Big Data Defense Advanced Research Projects Agency

National Institutes of Health (biomedical Big Data)

National Science Foundation - Big Data Research Initiative

Vision

3/23/2014

Thank You

Contact Information

Dr. O’Neil Smith

Harris Corporation

[email protected]

1. Amdahl, Gene M. (1967). "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): 483–

485. doi:10.1145/1465482.1465560

2. Hennessy, John L.; Patterson, David A., Larus, James R. (1999).Computer organization and design : the hardware/software interface (2. ed., 3rd print.

ed.). San Francisco: Kaufmann. ISBN 1-55860-428-6.

3. * Smith, O., Stark, R., Smith, P., St Romain, R., Blask, S. (2013). Point Spread Function (PSF) Noise Filter Strategy for Geiger Mode LiDAR, SPIE

4. Fouche, D. G. (2003). Detection and false-alarm probabilities for laser radars that use Geiger-mode detectors. Applied optics, 42(27), 5388-5398.

5. McIntosh, K. A., Donnelly, J. P., Oakley, D. C., Napoleone, A., Calawa, S. D., Mahoney, L. J., Shaver, D. C., (2002). InGaAsP/InP avalanche photodiodes for

photon counting at 1.06 μm. Applied Physics Letters, 81(14), 2505-2507.

6. Younger, R. D., McIntosh, K. A., Chludzinski, J. W., Oakley, D. C., Mahoney, L. J., Funk, J. E., Verghese, S. (2009, May). Crosstalk analysis of integrated

Geiger-mode avalanche photodiode focal plane arrays. In SPIE Defense, Security, and Sensing (pp. 73200Q-73200Q). International Society for Optics

and Photonics.

7. Soares, E. J., Germino, K., Glick, S. J., & Stodilka, R. Z. (2001). Determination of three-dimensional voxel sensitivity for two-and three-headed coincidence

imaging. In Nuclear Science Symposium Conference Record, 2001 IEEE (Vol. 4, pp. 2058-2061). IEEE.

8. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise.

InProceedings of the 2nd International Conference on Knowledge Discovery and Data mining (Vol. 1996, pp. 226-231). AAAI Press.

9. http://www.nvidia.com/object/tesla-servers.html. NVIDIA. [Online]

10.Leite, P., Teixeira, J. M. X. N., de Farias, T. S. M. C., Teichrieb, V., & Kelner, J. (2009, October). Massively Parallel Nearest Neighbor Queries for Dynamic

Point Clouds on the GPU. In Computer Architecture and High Performance Computing, 2009. SBAC-PAD'09. 21st International Symposium on (pp. 19-

25). IEEE. 11. * Smith, O. and Rangrajan, A., “Category Spaces: A New Approach to Supervised Dimensionality Reduction”, Journal of Machine Learning, in preparation, 2014.

12.B. Scholkopf and A. J. Smola, Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, 2002).

13.P. Marcotte and G. Savard, "Novel Approaches To The Discrimination Problem", Mathematical Methods of Operations Research 36, 6 (1992), pp. 517--

545.

14.S. D. Bay. “Combining nearest neighbor classifiers through multiple feature subsets”, In Proceedings of the 17th International Conference on Machine

Learning, (1998), pp. 37—45.

15.C. M. Bishop. “Neural Networks for Pattern Recognition”, (1995) Oxford University Press

http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf



http://en.wikipedia.org/wiki/Digital_object_identifier

http://dx.doi.org/10.1145/1465482.1465560

http://en.wikipedia.org/wiki/International_Standard_Book_Number

http://en.wikipedia.org/wiki/Special:BookSources/1-55860-428-6







16. R. A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems", Annals of Eugenics 7 (1936), pp. 179-188.

17. C. R. Rao and S. K. Mitra. “Generalized Inverse of Matrices and Its Applications”, (1971)

18. V. Vapnik, Statistical Learning Theory (Wiley Interscience, 1998).

19. K. Fukunaga, Introduction to Statistical Pattern Recognition2nd (Academic Press, 1990).

20. J. Ye, "Least Squares Linear Discriminant Analysis", in International Conference on Machine learning (ACM, 2007), pp. 7

21. J. Weston and C. Watkins, "Multi-class Support Vector Machines", Department of Computer Science, Royal Holloway, University of London (1998).

22. K. Crammer and Y. Singer, "On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines", The Journal of Machine Learning

Research 2 (2002), pp. 265-292.

23. U. Kressel, "Pairwise Classification and Support Vector Machines", in Advances in Kernel Methods (MIT Press, 1999), pp. 255-268.

24. T. Hastie and R. Tibshirani, "Classification by Pairwise Coupling", Advances in Neural Information Processing Systems 10 (1998).

25. D. Widdows and S. Peters, "Word Vectors and Quantum Logic: Experiments with Negation and Disjunction", in Mathematics of Language Conference

vol. 8, (, 2003), pp. 13 pp..

26. D. Widdows, Geometry and Meaning (Center for the Study of Language and Information, 2004).

27. E. H. Rosch, "Natrual Categories", Cognitive Psychology 4 (1973), pp. 328-350.

28. M. Bolla and G. Michaletzky and G. Tusnady and M. Ziermann, "Extrema of Sums of Heterogeneous Quadratic Forms", Linear Algebra and Its

Applications 269, 1-3 (1998), pp. 331-365.

29. T. Rapcsak, "On Minimization on Stiefel Manifolds", European Journal of Operational Research 143, 2 (2002), pp. 365-376.

Documents

Dr. O’Neil Smith, Harris Corporationmy.fit.edu/seces/slides/86.pdf · A process of inspecting, cleaning, transforming and modeling data with the goal of highlighting useful information,