Upload
mervin-owen
View
214
Download
1
Embed Size (px)
Citation preview
1
Integrating GPUs into CondorIntegrating GPUs into Condor
Timothy BlattnerTimothy BlattnerMarquette UniversityMarquette University
Milwaukee, WIMilwaukee, WI
April 22, 2009April 22, 2009
2
OutlineOutline Background and VisionBackground and Vision
Graphics CardsGraphics Cards
Condor ApproachCondor Approach
ProblemsProblems
Conclusions and Future WorkConclusions and Future Work
3
Graphics cardsGraphics cards Powerful – NVIDIA Tesla C1060Powerful – NVIDIA Tesla C1060
240 massively parallel processing cores240 massively parallel processing cores 4 GB GDDR34 GB GDDR3 CUDA CapableCUDA Capable
~993 gigaflops~993 gigaflops ~$1,300~$1,300
Cheap – NVIDIA 9800 GTCheap – NVIDIA 9800 GT 112 massively parallel processing cores112 massively parallel processing cores 512 MB GDDR3512 MB GDDR3 CUDA CapableCUDA Capable
~$120~$120
4
Vision and FocusVision and Focus Pool of computers containing graphics cards, Pool of computers containing graphics cards,
managed by Condormanaged by Condor Provide users the ability to utilize graphics cards Provide users the ability to utilize graphics cards
identified by Condoridentified by Condor
? ? ?
Central Manager
5
OpportunitiesOpportunities
Resources may already be thereResources may already be there Majority of machines have graphics cards in themMajority of machines have graphics cards in them
GPU resources sit idle while Condor runs on the GPU resources sit idle while Condor runs on the CPUCPU
Similar workSimilar work GPUGRID.netGPUGRID.net
Distributed computing project using NVIDIA Distributed computing project using NVIDIA graphics card for atom molecular simulations graphics card for atom molecular simulations of proteinsof proteins
Uses GPU-enabled BOINC clientUses GPU-enabled BOINC client
6
Prototype ImplementationPrototype Implementation Linux onlyLinux only
Script queries operating system and graphics cardScript queries operating system and graphics card
Hawkeye Cron job manager runs scriptHawkeye Cron job manager runs script
Script outputs graphics card information into ClassAd Script outputs graphics card information into ClassAd formatformat
Binary for NVIDIA cards for more specific Binary for NVIDIA cards for more specific informationinformation
7
Graphics Card ArchitectureGraphics Card Architecture
8
Graphics card APIsGraphics card APIs Favor general purpose computationsFavor general purpose computations
CUDA (NVIDIA)CUDA (NVIDIA)
Brook (ATI)Brook (ATI)
openCL (Khronos Group)openCL (Khronos Group)
9
CUDA Programming ModelCUDA Programming Model Kernels are functions run on the Kernels are functions run on the devicedevice (GPU) (GPU)
Host (CPU) code invokes kernels and determinesHost (CPU) code invokes kernels and determines– Number of threadsNumber of threads– Thread block structure for organizing threadsThread block structure for organizing threads
Kernel invocations are Kernel invocations are asynchronousasynchronous– Control returns to the CPU immediatelyControl returns to the CPU immediately– CUDA provides synchronization primitivesCUDA provides synchronization primitives– Some CUDA calls (e.g. memory allocation) are Some CUDA calls (e.g. memory allocation) are
synchronoussynchronous
10
Hawkeye Cron Job ManagerHawkeye Cron Job Manager Provides mechanism for collecting, storing, and Provides mechanism for collecting, storing, and
using information about computersusing information about computers
Periodically executes specified program(s)Periodically executes specified program(s)
Program outputs in form of ClassAdProgram outputs in form of ClassAd
Outputs are added to machine's ClassAdOutputs are added to machine's ClassAd
11
Hawkeye ImplementationHawkeye Implementation Added to local configuration fileAdded to local configuration file Runs script every minuteRuns script every minute Condor user must be granted graphics card Condor user must be granted graphics card
privileges in order to query the cardprivileges in order to query the card
STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST), STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST), UPDATEGPUUPDATEGPU
STARTD_CRON_UPDATEGPU_EXECUTABLE = gpu.shSTARTD_CRON_UPDATEGPU_EXECUTABLE = gpu.shSTARTD_CRON_UPDATEGPU_PERIOD = 1mSTARTD_CRON_UPDATEGPU_PERIOD = 1mSTARTD_CRON_UPDATEGPU_MODE = PeriodicSTARTD_CRON_UPDATEGPU_MODE = PeriodicSTARTD_CRON_UPDATEGPU_KILL = TrueSTARTD_CRON_UPDATEGPU_KILL = True
12
Script OutputScript Output HasGpu = True NGpu = 1 Gpu0 = "Quadro FX 3700" Gpu0CudaCapable = True Gpu0_Major = 1 Gpu0_Minor = 1 Gpu0Mem = 536150016 Gpu0Procs = 14 Gpu0Cores = 112 Gpu0ShareMem = 16384 Gpu0ThreadsPerBlock = 512 Gpu0ClockRate = 1.24 HasCuda = True -
13
Job SubmissionJob Submission Users can submit jobs with GPU requirements into CondorUsers can submit jobs with GPU requirements into Condor Portable across Linux DistrosPortable across Linux Distros
Universe = vanillaExecutable = tests/CudaJobInitialdir = gpuJobsRequirements = (HasGpu == true) && (Gpu0CudaCapable == true)
Log = gpu_test.log Error = gpu_test.stderrOutput = gpu_test.stdoutQueue
condor_submit gpu_job.submit
14
Access ControlAccess Control /dev/nvidiactl, /dev/nvidia* devices need read/write /dev/nvidiactl, /dev/nvidia* devices need read/write
by submitting/running userby submitting/running user
Could beCould be
Nobody, open accessNobody, open access
Controlled by Unix group, containing limited Controlled by Unix group, containing limited usersusers
Integrated more directly with Condor user control, Integrated more directly with Condor user control, slot usersslot users
15
ProblemsProblems PreemptionPreemption
Jobs running in GPU kernel cannot be interrupted Jobs running in GPU kernel cannot be interrupted reliably by Unix signalsreliably by Unix signals
Watchdog timerWatchdog timer After 5 seconds, job is killedAfter 5 seconds, job is killed A Solution: use general purpose graphics card as A Solution: use general purpose graphics card as
secondary displaysecondary display
Memory SecurityMemory Security Malicious users, interrupting a job between GPU Malicious users, interrupting a job between GPU
kernel calls, have the opportunity to overwrite or kernel calls, have the opportunity to overwrite or copy GPU memorycopy GPU memory
16
SummarySummary
Condor based approach for advertising GPU Condor based approach for advertising GPU resourcesresources
Linux-based prototype implementationLinux-based prototype implementation
Can access available GPUsCan access available GPUs Works best on dedicated machines, with no need Works best on dedicated machines, with no need
for preemptionfor preemption
Current LimitationsCurrent Limitations Doesn’t report GPU usageDoesn’t report GPU usage Lack of preemptionLack of preemption Limited OS and video card supportLimited OS and video card support
17
Future WorkFuture Work Create benchmark and testing suiteCreate benchmark and testing suite
Handle preemptionHandle preemption Investigate how watchdog worksInvestigate how watchdog works
GPU usage reportingGPU usage reporting
Integrate memory protectionIntegrate memory protection
Support more Operating SystemsSupport more Operating Systems Windows and Mac OS XWindows and Mac OS X
Support alternative architectures and APIsSupport alternative architectures and APIs Brook and OpenCLBrook and OpenCL
18
Questions?Questions?
Contact:Contact:[email protected]
[email protected]://sourceforge.net/projects/condorgpu/https://sourceforge.net/projects/condorgpu/