View
232
Download
8
Category
Preview:
Citation preview
HPE Moonshot for HPCHPCN Workshop 2016GöttingenThomas.Bloeth@hpe.com
Sorin.Cheran@hpe.com
Apollo 8000Supercomputing
Apollo 6000Rack Scale HPC
Apollo 4000Server Solutions Purpose
Built for Big Data
Apollo 2000Enterprise Bridge to Scale-Out Compute
Big DataWorkloadsHPC Workloads
Intel Mellanox NVIDIA Seagate
Pla
tform
sS
olut
ions
/ IS
Vs
HPE Apollo platforms and solutions optimized for HPC, IoT and Big Data
Next GenWorkloads
MoonshotConverged Edge Data Acquisition Compute
System
Video encoding
Mobile workplace
IoT
Oil and gas Life Sciences Financial Services
Manufacturing CAD/CAE
Academia Object Storage
Data Analytics
Scality
Cleversafe
Ceph
Hortonworks
Hadoop
Cloudera
Schlumberger
Paradigm
Halliburton
Gaussian
BIOVIA Redline
Synopsys
ANSYS Custom Apps
HPC Storage
Lustre BeeGFS
HPC
1 Reduces data center space 2 Reduces power required 3 Reduces complexity
... and performance and workload optimized
What is the Moonshot System Family?
4
What is the Moonshot System Family?
Moonshot 1500
Example performance and workload optimized solutions:
• NoSQL (Apache Cassandra)• Hadoop• Enterprise Search Analytics
Whole solutions
Customer Value Focus
Communities of Experts
Market-defined ecosystem
• Hosted desktops• Application delivery• Trader & Engineering WS
• Repatriation• Motion graphic sites• SoMe, news sites
Big Data & AnalyticsMedia
Processing
Mobile Workspace
HPC
WebHosting
MobileWorkspace
How Has Moonshot Been Reinvented?
1) Intel Xeon Processors• Intel’s premier processor brand, in high demand• Integrated caches and GPUs outperform Intel E5 in many applications
2) World Class HPE Systems Management• New management HW & SW• Industry leading HPE iLO
4) New Moonshot Global Collaborations• Intel Alliances• Expanded Citrix solutions• Virtualization with VMware• SI partnership with Accenture• Altair for new HPC workloads
3) New Chassis• 45 server blades• 4 server blades• 1 server blade
Moonshot 1500 Moonshot 400, coming this summer• Up to 64 Xeon cores, 4 server blades• Standard 1U rack-mount form factor• 4 standard PCIe slots• Optional shock, vibration, temperature rugged
Moonshot 100, coming this summer• 1 server blade• Small mountable form factor• 2 standard PCIe slots• Optional shock, vibration, temperature rugged
Moonshot 1500• Up to 180 servers, 45 server blades• Standard 4.3U rack-mount form factor• Integrated NW Switch and Systems Management
What is the Moonshot System Family?
Moonshot 400
Moonshot 100
Coming soon…
Coming soon…
What is the Moonshot System Family?
Integrated NW switches Simple cable managementSuper dense, converged system
Rear View Moonshot 150045 server blades
Moonshot 4004 server blades
Moonshot 1001 server blade
Coming this summer
Coming this summer
What is the Moonshot System Family?
Moonshot 150045 server blades
Moonshot 4004 server blades
Moonshot 1001 server blade
Sameserver blades
Server Blade Options:
m300, m350 Intel Atom
m400, m800 ARM
m700 AMD with GPU
m710p Intel Xeon with GPU
Server Blades Coming Soon:
m700p AMD with GPU
m510 Intel Xeon – 16 cores
m710x Intel Xeon with GPU
Coming this summer
Coming this summer
CFD CryptographyScienceApps
Weather Forecast
Accelerated Workloads
Weather Forecast
CFD ScienceApps
Cryptography Accelerated Workloads
HPC Hybrid infrastructure
The right compute for the right application … yes even in HPC
9
Same ApplicationSame Scheduling Same MonitoringSame ManagementSame OS
1 X 1.5 X 1 X 1 X 1 X 1.3 X 2.2 X
Hybridsomething (as a power plant, vehicle, or electronic circuit) that has two different types of
components performing essentially the same function a thing made by combining two different elements; a mixture.
The first Hybrid cluster – 1st HPC and Moonshot usecaseThe Formula 1 story
A complete solution architecture – Not only Moonshot
The solution• Multiple HPE platforms for specific CFD workloads
HP ProLiant m350 Servers • To produce the largest possible number of CFD
solver jobs in a 8 weeks period• To have a low power solution that would save
energy costs and prepare them for the nextregulations.
HP ProLiant SL230s Servers• For urgent CFD solver tasks
HP ProLiant BL460c Gen9 Servers • To run pre- and post-processing tasks, and CFD
solving jobs not related with the current racing car
Networking – What we can do in a solution
2 x 10G / node
2x (3x40G)
uplink
Moonshot
Chassis #2
45xm
710P
IRF
2 x
10G
/ no
de
2x (3
x40G
) up
link
Moo
nsho
tC
hass
is #
30
45x
m71
0P
IRF
Solution 1
45x10 – 24x10(6x40) 1.87:1
2 x 10G / node
4 x40G uplink
Moonshot
Chassis #2
45xm
710P2
x 10
G /
node
4x40
G u
plin
k
Moo
nsho
tC
hass
is #
30
45x
m71
0P
Solution 2
45x10 – 16x10 (4x40) 2.81:1
Networking – What we can do in a solutionSolution 3
HP ProLiant m710p Server CartridgeCPU Intel Xeon E3-1284L v4 “Broadwell-H” with Iris Pro P6300 GPU
Intel C220 “LynxPoint” PCH
CPU 4c (8t), base 2.9 GHz (3.8 GHz single-core Turbo), 128 MB eDRAM (L4) shared with GPU
GPU GPU @ 0.5 GHz (1.15 GHz Turbo). Address up to 16 GB of main memory (1.5 GB in Haswell-H).Enhancements in Execution Unit (EU) performance, Intel GVT (vGPU), 3D and transcodingSupport for OpenCL 1.2 (Linux), OpenCL 2.0 (Windows) and OpenGL 4.2.
MEMORY Total of 32 GB of ECC protected memory, dual-memory channels(4) 8 GB DDR3 1600MHz LV non-ECC SO-DIMMs with (8) embedded DRAMs for ECC protection.
NETWORK Integrated NIC: dual port 10GbE Mellanox CX3 PRO with RoCE. Supported Switch(s): - Moonshot 45XGc - 45 port 10Gb Downlinks, (4) 40GbE QSFP uplinks. Aspirational 16 SFP+.- Moonshot 180G – 180 port 1Gb Downlinks, (4) 40GbE QSFP uplinks
STORAGE 1 SATA x1/PCIe G2 x1 attached M.2 SSD.120GB or 480GB M.2 (2280). Local SSD boot, PXE boot, iSCSI boot and storage (iSER acceleration for enabled targets)
POWER Cartridge: <69W
OS AND SOFTWARE
Ubuntu 15.04, 14.04.3 LTS (Aug 2015/ P1), RHEL 6.7, 7.1, CentOS 6.7, 7.2, SuSE 11.4, 12Windows Server 2012, 2012 R2, Windows 7 x64, 8.1. Aspirational Hyper-V, XenServer.
Moonshot Cartridge M710p
m710p: Xeon E3 mono-socket processor with RDMA capable 10GB NIC
The cartridge offers about 5GB/s of memory bandwidth per core; it is comparable to what the new Xeon Haswell cores get from dual-socket servers;
Interconnect Performance measured with MPI Latency/Bandwidth between 2 servers 2.0us, 1080MB/sLow latency between M710 cartridges results from RDMA capabilities of the Mellanox chip, close to IB FDR 1us; The asymptotic bandwidth is very close to what 10GbE can achieve but 6 time lower than IB-FDR 6000 MB/s
Key Features
Memory Performance : Typical performance for all 4 "C" STREAM kernels : 1-core-bw 15GB/s, 4-cores-bw 21GB/s
Sequential Bandwidth of Local Storage: 130MB/s Write , 440 MB/s Read Write performance similar to SATA LFF drive; read 2.6 faster than the SATA LFF
0
1
10
100
1000
10000
1 10 100 1000 10000 100000 1000000 10000000
Ban
dwid
th (M
B/s
)
Message size (MB)
OpenMPIIMPIPMPIMPICH31,93
2,00
1,91
8,87
0,00 2,00 4,00 6,00 8,00 10,00 12,00 14,00
Latency (us)
MPICH
PMPI
IMPI
OpenMPI
Stream
Copy Scale Add Triadm710p-4 cores 20869,00 20968,00 21786,00 21772,00E5-2680v3 - 24 cores 106414,00 106785,00 115396,00 115194,00m710x-4 cores - 2133 memory 27176,00 26761,00 25922,00 26320,00m710x-4 cores - 2400 memory 29350,08 28901,88 27995,76 28425,60
0,0020000,0040000,0060000,0080000,00
100000,00120000,00140000,00
MB
/s
Stream per server
Copy Scale Add Triadm710p-4 cores 5217,25 5242,00 5446,50 5443,00E5-2680v3 - 24 cores 4433,92 4449,38 4808,17 4799,75m710x-4 cores - 2133 memory 6794,00 6690,25 6480,50 6580,00m710x-4 cores - 2400 memory 7337,52 7225,47 6998,94 7106,40
0,001000,002000,003000,004000,005000,006000,007000,008000,00
MB
/s
Stream per core
Benchmarking clusters
Software– RHEL7.1 + MLNX OFED
– GNU, Intel, PGI compilers
– Intel MKL, AMD ACML, OpenBLAS
– OpenMPI, MVAPICH, Intel MPI, Platform MPI
– Many monitoring/profiling tools...
Management
– HP Insight CMU with Collectl metrics
– SLURM with enhanced prologs/epilogs and many extended features
Hardware– Moonshot m710p
– 1x E3-1284Lv4 ( 4 cores)– 32 GB(4x8GB)– L4 cache (128MB)– 10GB Mellanox RoCE– 1 SDD M.2
Fast storage available as:– NFS mounted over IB– Lustre (DDN, Seagate/Xyratex)– Panasas
• CFD/CAE• Dassault Systemes’ Abaqus Explicit – finite element analysis product• LSTC’s - LS-DYNA advanced general-purpose multiphysics simulation software package ( crash)• ESI Group’s PamCrash - crash simulation• CD-Adapco ‘s STARCCM – Multipurpose CFD • Ansys’s Fluent – CFD code • Altair’s Radioss – Crash simulation• Altair’s Optistruct – structural Analysis
• Science Apps• Molecular Dynamics : Amber, Gromacs, Lamps• Bioinformatics: Bowtie, BWA
• Cryptography
Benchmarks vs Haswell - Applications Used
CAE Apps
0
0,2
0,4
0,6
0,8
1
1,2
Abaqus Optistruct Fluent StarCCM ConvergeCFD PowerFlow PamCrash LS-DYNA Radioss Feko
Haswell –Architecture:E5-2680 v3 (2.5 GHz)
M710p Architecture::E3-1284L v4 (2.9 GHz)
Low
is b
ette
r
Dynamics Bioinformatics Quantum
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
AM
BER
GR
OM
AC
S
LAM
MPS
NC
BI B
LAST
BLA
STB
ENC
H
BO
WTI
E
BW
A
Gau
ssia
n
Sie
sta
Science AppsR
elat
ive
Tim
e
Low
is b
ette
r
XL230a Gen9 - Haswell – RH EL 6.6Architecture:E5-2698 v3 (2.3 GHz)
m710p – RH EL 6.7 Beta Architecture::E3-1284L v4 (2.9 GHz)
Cryptography ( Everything is secret)
64 128 180E5-2698v3 1043,13 427,73 310,00E3-1284Lv4 339,76 193,83 160,55
10
210
410
610
810
1010
1210
Tota
l tim
e (s
)
Test1
8 64 128 180E5-2698v3 45,15 55,61 55,12 55,21E3-1284Lv4 38,61 40,43 40,83 40,57
0,00
10,00
20,00
30,00
40,00
50,00
60,00
Tim
e us
ed (s
)
Test 2
Fluent - M710p vs E5-2690v4Numbers are not in seconds but SOLVER RATINGS – the bigger the better.
1 4 8 16 28 56e5-2690v4 194,8 741,9 1408,9 2554,3 3549,3 7185m710p 243,1 762,1 1505,9 2958,9 5316,9 9340,5
0100020003000400050006000700080009000
10000So
lver
Rat
ings
Aircraft_Wing_2m
The higher the better
1 4 8 16 28 56e5-2690v4 22,6 84,4 161,9 288,4 389,4 762,6m710p 28 80,3 160,8 326,6 582,6 1171,1
0
200
400
600
800
1000
1200
1400
Solv
er R
atin
gs
Aircraft_Wing_14m
TAU - checking the partitionerPrivate vs zoltan partitioner
Zoltan's partitioner impact : Solver elapsed time performs around 15 % faster vs private partitioner
Global elapsed time is around 20 % faster with Zoltan
22
96 coresPrivate - Intra chassis 11951Private Inter-chassis 12117Zolta -Inter-chassis 9869
02000400060008000
100001200014000
Seco
nds
Total time
96 coresPrivate Inter-chassis 11533Zolta -Inter-chassis 9829
0
2000
4000
6000
8000
10000
12000
14000
SEci
nds
Solver time
e5-2698v3 vs e5-2690v4 vs e3-1284lv4 – Zoltan partitioner –
23
64 96 180 2562698v3 78823 53061 28106 208762690v4 57103 39473 22803e3-1284Lv4 47567 31988 17440 13645
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Seco
nds
Comparison exampleapplication performance dependent !
24
m710p xl230a E5-2698v3 xl230a E5-2680v3Chassis per rack 8 6 6Cartridges per chassis 45Power per rack with HPL 33200 27000 25200 WattsCores per rack 1440 1920 1440Memory per core 8,0 4,0 5,3NW Bandwidth per core 250,0 31,3 41,7 MB/sPerf per core 1,38 0,90 1,00Power per core 19,2 14,1 17,5 WattsPower/perf ratio 13,94 15,65 17,50Perf per rack 1,38 1,20 1,00
Price 102% 139% 100%Price/performance ratio 74,15% 116,24% 100,00%
Benefits
Full binary compatibility with dual-socket x86 servers Fast compute cores Excellent memory bandwidth shorter time to solution Save on license costs => even better TCO Parallelize on less cores: better scaling
8GB of memory per core Excellent network bandwidth per core (equivalent to IB FDR on a dual-socket server) Integrate with Infiniband fabric if necessary Better isolation (security / fault tolerance): one server has only 4 cores Potential extra compute performance with embedded GPU (OpenCL) Remote visualization capabilities for free Simple = very robust platform 3 years base warranty included
HPE ProLiant m510Typical Workload All Purpose Compute Workhorse, Big Data, Media Processing, and more!
SoC Intel® Xeon® D “Broadwell-DE” 2.4GHz, 8 core12MB of L3 Cache
Graphics N/A
Memory ECCDDR4 SDRAM (2133/2400MHz) (8GB, 16GB, 32GB)Four (4) DIMM slotsMaximum Configuration 128GB (4x32GB)
Network Controller Mellanox Connect-X3, Dual 10GbE NIC
Onboard Storage Three (3) m.2 Modules(1) - SATA m.2 (2242) – 32GB or 64GB(2) - PCIe m.2 (2280): up to 960GB
External Storage iSCSI with iSER acceleration
Power Cartridge Max: TBD, Typical: 90W(Includes system overhead)
OS Ubuntu 15.04, Ubuntu 14.04.3 LTSRHEL 6.7/ 7.2SLES 11 SP4, SLES12Windows Server 2012/2012 R2/v.NEXTCentOS 6.7/7.2
November 12, 2015
2H2016
Coming soon…
HPE ProLiant m710xWorkload Application Delivery, Video Transcoding, Big Data Analytics
SoC Intel® Xeon® E3-1285L v53.0GHz (3.9 GHz Turbo), 4-core128MB eDDR
Graphics Integrated Intel Iris Pro GT4e GPU with 72 execution unitsILO Remote Console
Memory (4) DDR4 SODIMMS (2133/2400MHz) (8GB, 16GB)Maximum Configuration 64GB (4x16GB)
Network Controller Mellanox Connect-X3, Dual 10GbE NIC
Onboard Storage Five (5) m.2 Modules (1) – SATA m.2 (2242) – 64GB, 120G or 240G(4) – NVMe m.2 (22110): up to 960GB
External Storage iSCSI with iSER acceleration
Power Up to 99W
OS Ubuntu, RHELSUSE , SLESWindows Server 2012/2012 R2/V.next,Windows 7/ 8.1/10CentOS , Xenserver
February 2016
2H2016
Coming soon…
Consider using one of our Moonshot Global Discovery Labs:
Houston, Texas
Grenoble, France
Singapore
• Test and pre-validation of Moonshots system and end-to-end solutions
• On-site access or secure remote access
Please Have Another Look at Moonshot Reinvented . . .
28
Recommended