View
228
Download
4
Category
Preview:
Citation preview
Distributed computing at the Facility level: applications and attitudes
Tom GriffinSTFC ISIS Facility
tom.griffin@stfc.ac.uk
NOBUGS 2008, Sydney
Spare cycles
• Typical PC CPU usage is about 10%
• Usage minimal 5pm – 8am
• Most desktop PCs are really fast
• Waste of energy
• How can we use (“steal?”) unused CPU
cycles to solve computational problems?
Types of Application
•CPU Intensive
•Low to moderate memory use
•Not too much file output
•Coarse grained
•Command line / batch driven
•Licensing issues?
Distributed computing solutions
Lots of choice CONDOR, GridEngine, GridMP…
• Grid MP Server hardware• Two, dual Xeon 2.8GHz servers RAID 10
• Software• Servers run RedHat Linux Enterprise Server / DB2• Unlimited Windows (and other) clients
•Programming• Web Services interface – XML, SOAP• Accessed with C++ , Java, C#
• Management Console• Web browser based• Can manage services, jobs, devices etc
• Large industrial user base•GSK, J&J, Novartis etc.
Installing and Running Grid MP
Server Installation2 hours
Client InstallationCreate MSI and RPM using ‘setmsiprop’30 seconds
Manual InstallBetter security on Linux and Macs
Adapting a program for GridMP
1) Think about how to split your data
2) Wrap your executable
3) Write the application service• Pre and Post processing
• Fairly easy to write
• Interface to grid via Web Services
•C++, Java, C#
Package your executable
PROGRAM MODULEEXECUTABLE
Uploaded to, and residenton, the server
ExecutableDLLs Standard data
files Environmentvariables
Compress?
Encrypt? }
Create / run a jobPkg1 Pkg4Molecules Proteins
Pkg2 Pkg3
Create job, generatecross product
Datasets
Workunits
Clie
nt s
ide
Ser
ver
side
https://
Start job
Code examples
Mgsi.Job job = new Mgsi.Job();job.application_gid = app.application_gid;job.description = txtJobName.Text.Trim();job.state_id = 1;job.job_gid = ud.createJob(auth, job);
Mgsi.JobStep js = new Mgsi.JobStep();js.job_gid = job.job_gid;js.state_id = 1;js.max_concurrent = 1js.max_errors = 20;js.num_results = 1;js.program_gid = prog.program_gid;
Code examplesMgsi.DataSet ds = new Mgsi.DataSet();ds.job_gid = job.job_gid;ds.data_set_name = job.description + "_ds_" + DateTime.Now.Ticks;ds.data_set_gid = ud.createDataSet(auth, ds);
for (int i = 1; i <= numWorkunits.Value; i++) {FileTransfer.UploadData uploadD = ft.uploadFile(auth, Application.StartupPath + "\\testdata.tar");Mgsi.Data data = new Mgsi.Data();data.data_set_gid = ds.data_set_gid;data.index = i;data.file_hash = uploadD.hash;
data.file_size = long.Parse(uploadD.size);datas[i - 1] = data; }
ud.createDatas(auth, datas);
ud.createWorkunitsFromDataSetsAsync(auth, js.job_step_gid, new string[] { ds.data_set_gid }, options);
PerformanceFamotidine form B13 degrees of freedomP21/c V=1421Sync data to 1.64A1 x 107 moves per run, 64 runs
Standard DASH2.4GHz Core2 Quadusing single core
Job complete = 9 hrs
Gdash submit to testgrid of 5 in-use PCs4 x 2.4GHz Core2 Quad1 x 2.8GHz Core2 Quad
Job complete = 24 minutes
Speedup = 22.5 x
Performance – 999 SA runs, full grid
Time
Wor
kuni
ts
317 coresfrom 163 devices
42 Athlons: 1.6–2.2Ghz168 Core 2 duos: 1.8–3 Ghz36 Core 2 quads: 2.4–2.8 Ghz1 duron @ 1.2Ghz42 P4s 2.4–3.6Ghz27 Xeons: 2.5–3.6Ghz
4 days 18 hours CPU in ~40 minutes elapsed time
A Particular Success - McStas
HRPD supermirror guide design
Complex designMeaningful simulations take a long time
Want to try lots of ideas
Many runs of >200 CPU days
Simpler model was best value
Massive improvement in flux
Significant cost savings
Problems
McStas
Interactions in the wild
Symantec Anti-Virus
Did not show up in testing
McStas restricted to night running only
User Attitudes
A range
Theft
“I’m not having that on my machine”
First thing to get blamed
Gaining more trust
Evangelism by users
Flexibility with virtualisation
Request to run ‘GARefl’ code
ISIS is Windows based
Few Linux PCs
VMWare server is freeware
8 Hosts gave 26 cores
More cores = more demand
56 real cores recruited from servers, 64-core Beowulf
10 mac cores
Run Linux as a job
Flexibility with virtualisation
The Future
Grid growing in power every dayNew machines added, old ones still left on
ElectricityEnergy saving drive at STFC – switch machines off
Wake On-LAN ‘Magic Packets’ + Remote hibernate
LaptopsGood or bad?
Summary
Distributed computing Perfect for coarse-grained,CPU intensive, ‘disk-lite’
Resources Use existing resources. Power increases with time, no need to write-off assets. Scalable
Not just faster Allows one to try different scenarios
Virtualisation Linux under Windows, Windows under Linux.
Green credentials PCs are running anyway, better to utilise them. Can be powered down & up.
Acknowledgements
ISIS Data Analysis GroupKenneth ShanklandDamian Flannery
STFC FBU IT Service Desk and ISIS Computing Group
Key UsersRichard Ibberson (HRPD)Stephen Holt (GARefl)
Questions?
Recommended