31
Harnessing Idle Harnessing Idle Computers Computers with Condor at Notre with Condor at Notre Dame: Dame: Impact on Research in 2006 Impact on Research in 2006 Prof. Douglas Thain Prof. Douglas Thain CSE Department CSE Department 9 Feb 2007 9 Feb 2007

Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

  • Upload
    evelia

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006. Prof. Douglas Thain CSE Department 9 Feb 2007. What is Condor?. Condor is software from UW-Madison that harnesses idle cycles and storage from existing machines. (ND workstations are 89% idle!) - PowerPoint PPT Presentation

Citation preview

Page 1: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Harnessing Idle ComputersHarnessing Idle Computerswith Condor at Notre Dame:with Condor at Notre Dame:

Impact on Research in 2006Impact on Research in 2006

Prof. Douglas ThainProf. Douglas ThainCSE DepartmentCSE Department

9 Feb 20079 Feb 2007

Page 2: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

What is Condor?What is Condor?Condor is software from UW-Madison that Condor is software from UW-Madison that harnesses idle cycles and storage from existing harnesses idle cycles and storage from existing machines. (ND workstations are 89% idle!)machines. (ND workstations are 89% idle!)

With the assistance of OIT/CSE staff, Condor has With the assistance of OIT/CSE staff, Condor has been installed on 379 CPUs in the Colleges of been installed on 379 CPUs in the Colleges of Engineering and Science since early 2005.Engineering and Science since early 2005.

Our Condor pool is expanding the capabilities of Our Condor pool is expanding the capabilities of researchers in CSE, EE, AME, and Physics researchers in CSE, EE, AME, and Physics perform CPU and storage intensive research.perform CPU and storage intensive research.

More users and contributors are welcome to join!More users and contributors are welcome to join!

http://www.nd.edu/~condor

Page 3: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006
Page 4: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Computing EnvironmentComputing Environment

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU CPU CPU CPU

Disk Disk Disk Disk

Fitzpatrick Workstation Cluster

CCL Research ClusterCVRL Research Cluster

Miscellaneous CSE Workstations

CPU

CPU CPU

Disk

I will only run jobs when there is no-one working at

the keyboard

I will only run jobs between midnight and 8 AM

I prefer to run a job submitted by a CSE

student.

CondorMatchMakerJob

JobJob

Job

Job Job

Job

Job

CPU

Disk

JobJob

JobJob

Job Job Job Job

Page 5: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Scheduling PolicyScheduling PolicyFirst, Owners Exercise Absolute ControlFirst, Owners Exercise Absolute Control– Set who, what, and when can use machine.Set who, what, and when can use machine.– Can kick jobs off at any time manually.Can kick jobs off at any time manually.– Default policy:Default policy:

Start job if console idle > 15 minutesStart job if console idle > 15 minutes

Suspend job if console used or CPU busy.Suspend job if console used or CPU busy.

Kick off job if suspended > 10 minutes.Kick off job if suspended > 10 minutes.

After satisfying that principle, the users After satisfying that principle, the users split available CPU hours equally. split available CPU hours equally.

A little more complicated, see details here:http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml

Page 6: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

CPU HistoryCPU History

Storage HistoryStorage History

Page 7: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Current Donors Feb 2007Current Donors Feb 2007OwnerOwner NodesNodes CPUsCPUs Storage (TB)Storage (TB)

CRC/OITCRC/OIT 9292 9292 3.73.7

CSECSE 7373 124124 11.711.7

Prof. ThainProf. Thain 5959 9191 5.55.5

Prof. FlynnProf. Flynn 1818 3535 0.65 0.65

Prof. StriegelProf. Striegel 1010 2020 0.650.65

MiscMisc 77 1717

TotalTotal 259259 379379 20.2 TB20.2 TB

Page 8: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Flocking Between UniversitiesFlocking Between Universities

Notre Dame379 CPUs

Wisconsin1200 CPUs

Purdue A541 CPUs

Purdue B1016 CPUs http://www.cse.nd.edu/~ccl/operations/condor/

Page 9: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006
Page 10: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006
Page 11: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Total Consumption in 2005Total Consumption in 2005

http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html

CPU-Hours Harnessed by Condor (56%) 642912

CPU-Hours Totally Unused (31%) 350148

CPU-Hours Consumed by Owner at Keyboard (11%) 134978

CPU-Hours Total (100%) 1128038

Page 12: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Top Condor Users in 2005Top Condor Users in 2005

??? ??? 92 40 0.22% 1390 [email protected]

Izaguirre CSE 32 32 0.28% 1820 [email protected]

??? ??? 52 52 0.64% 4116 [email protected]

Chawla CSE 1814 145 0.87% 5619 [email protected]

Izaguirre CSE 120 112 0.96% 6148 [email protected]

Striegel CSE 800 24 1.15% 7371 [email protected]

Striegel CSE 688 28 1.56% 10016 [email protected]

Fuja EE 85 78 2.06% 13236 [email protected]

Kogge CSE 100 100 3.49% 22425 [email protected]

- CRC 7 7 4.23% 27184 [email protected]

Flynn CSE 2004 163 7.94% 51058 [email protected]

Kogge CSE 740 204 85.27% 548187 [email protected]

2268 327 100.00% 642912 Total

Advisor Dept Max Jobsin Queue

Max JobsRunning

Percentof Total

CPUHours

User

http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html

Page 13: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Total Consumption in 2006Total Consumption in 2006

CPU-Hours Harnessed by Condor (48%) 1161176

CPU-Hours Totally Unused (39%) 934277

CPU-Hours Consumed by Owner at Keyboard (11%) 281003

CPU-Hours Total (100%) 2376456

http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html

Page 14: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Top Condor Users in 2006Top Condor Users in 2006

Izaguirre CSE 72 30 0.04% 413 [email protected]

Thain CSE 35 29 0.08% 980 [email protected]

Renaud AME 6 6 0.15% 1690 [email protected]

Fuja EE 66 53 0.24% 2741 [email protected]

Striegel CSE 366 24 0.42% 4842 [email protected]

Costello EE 192 160 1.78% 20708 [email protected]

Thain CSE 300 299 2.00% 23217 [email protected]

Izaguirre CSE 1082 64 2.38% 27693 [email protected]

- CRC 21 21 2.70% 31341 [email protected]

Kogge CSE 1275 213 16.02% 186050 [email protected]

Flynn CSE 20030 447 35.82% 415972 [email protected]

Chawla CSE 60314 1142 40.57% 471126 [email protected]

61695 1156 100.00% 1161176 Total

Advisor Dept Max Jobsin Queue

Max JobsRunning

Percentof Total

CPUHours

User

http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html

(> 389 jobs running by some migrating to Purdue and UW.)

Page 15: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Research Projects Using CondorResearch Projects Using Condor

Data Mining and ApplicationsData Mining and Applications– CSE: ChawlaCSE: Chawla

Multidimensional Biometric Imaging and Applications (NSF/DOJ)Multidimensional Biometric Imaging and Applications (NSF/DOJ)– CSE: Flynn and BowyerCSE: Flynn and Bowyer

High End Biometric Computing (NSF)High End Biometric Computing (NSF)– CSE: Thain and FlynnCSE: Thain and Flynn

Architectures and Devices for Quantum Dot Cellular Automata (NSF)Architectures and Devices for Quantum Dot Cellular Automata (NSF)– EE and CSE: Kogge, Lent, Fay, OrlovEE and CSE: Kogge, Lent, Fay, Orlov

GEMS Grid Enabled Molecular Simulations (NSF)GEMS Grid Enabled Molecular Simulations (NSF)– CSE and Chem: Izaguirre, Striegel, PengCSE and Chem: Izaguirre, Striegel, Peng

Delay-Constrained Multihop Transmission in Wireless Networks: Delay-Constrained Multihop Transmission in Wireless Networks: Interaction of Coding, Channel Access, and Routing (NSF/NASA/Moto)Interaction of Coding, Channel Access, and Routing (NSF/NASA/Moto)– EE: Laneman, Costello, Fuja, HaenggiEE: Laneman, Costello, Fuja, Haenggi

ND Design Automation LaboratoryND Design Automation Laboratory– AME: RenaudAME: Renaud

GRAND: Gamma Ray Astrophysics at Notre DameGRAND: Gamma Ray Astrophysics at Notre Dame– Physics: Poirer (Distributed Storage)Physics: Poirer (Distributed Storage)

Page 16: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Recent Papers SupportedRecent Papers Supportedby Cycles from Condor at ND (1)by Cycles from Condor at ND (1)N. Chawla, D. Cieslak, L. Hall, A. Joshi, "Killing Two Birds with One Stone: N. Chawla, D. Cieslak, L. Hall, A. Joshi, "Killing Two Birds with One Stone: Countering Cost and Imbalance," Data Mining and Knowledge Discovery, Countering Cost and Imbalance," Data Mining and Knowledge Discovery, under review.under review.D. Cieslak, N. Chawla, "The Calibration and Power of Probability Estimation D. Cieslak, N. Chawla, "The Calibration and Power of Probability Estimation Trees in Ensembles," 7th International Workshop on Multiclassifier Systems, Trees in Ensembles," 7th International Workshop on Multiclassifier Systems, under review. under review. D. Cieslak, N. Chawla, "Reducing Loss and Improving ROC AUC Through D. Cieslak, N. Chawla, "Reducing Loss and Improving ROC AUC Through Sampling," International Conference on Machine Learning , Corvallis, Sampling," International Conference on Machine Learning , Corvallis, Oregon, 2007.Oregon, 2007.N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation Trees,“ Proceedings of the AAAI Workshop on the Evaluation Methods in Trees,“ Proceedings of the AAAI Workshop on the Evaluation Methods in Machine Learning, Boston, July 2006Machine Learning, Boston, July 2006D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via Data Mining," Hot Topics Sessions: 15th IEEE International Symposium on Data Mining," Hot Topics Sessions: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), Paris, France, June High Performance Distributed Computing (HPDC-15), Paris, France, June 20062006D. Cieslak, N. Chawla, A. Striegel, "Combating Imbalance in Network D. Cieslak, N. Chawla, A. Striegel, "Combating Imbalance in Network Intrusion Datasets,“ IEEE International Conference on Granular Computing, Intrusion Datasets,“ IEEE International Conference on Granular Computing, Athens, Georgia, May 2006.Athens, Georgia, May 2006.

CSE: Data Mining

Page 17: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Recent Papers SupportedRecent Papers Supportedby Cycles from Condor at ND (2)by Cycles from Condor at ND (2)X. Chen, T. Faltemier, P. Flynn, and K. Bowyer, “Human Face Modeling and X. Chen, T. Faltemier, P. Flynn, and K. Bowyer, “Human Face Modeling and Recognition Through Multi-View High Resolution Stereopsis”, Biometrics: Recognition Through Multi-View High Resolution Stereopsis”, Biometrics: Theory, Applications, and Systems, 2006.Theory, Applications, and Systems, 2006.D. Woodard, T. Faltemier, P. Yan, and P. Flynn, “A Comparison of 3D D. Woodard, T. Faltemier, P. Yan, and P. Flynn, “A Comparison of 3D Biometric Modalities”, Biometrics: Theory, Applications, and Systems, 2006.Biometric Modalities”, Biometrics: Theory, Applications, and Systems, 2006.T. Faltemier, P. Flynn, and K. Bowyer, “3D Face Recognition with Cruvature T. Faltemier, P. Flynn, and K. Bowyer, “3D Face Recognition with Cruvature Based Region Selection”, 3D Data Processing, Visualization, and Based Region Selection”, 3D Data Processing, Visualization, and Transmission, 2006.Transmission, 2006.T. Faltemier, K. Bowyer, and P. Flynn, “Region Ensemble for 3D Face T. Faltemier, K. Bowyer, and P. Flynn, “Region Ensemble for 3D Face Recognition and Indexing”, under submission.Recognition and Indexing”, under submission.T. Faltemier, K. Bowyer, and P. Flynn, “Using Multiple Gallery Images for T. Faltemier, K. Bowyer, and P. Flynn, “Using Multiple Gallery Images for 3D Face Recognition”, under submission.3D Face Recognition”, under submission.

Timothy J. Dysart. "Defect Properties and Design Tools for Quantum Dot Timothy J. Dysart. "Defect Properties and Design Tools for Quantum Dot Cellular Automata." Master's Thesis, 2005. PDFCellular Automata." Master's Thesis, 2005. PDFTimothy J. Dysart, Peter M. Kogge, Craig S. Lent, and Mo Liu. "An Analysis Timothy J. Dysart, Peter M. Kogge, Craig S. Lent, and Mo Liu. "An Analysis of Missing Cell Defects in Quantum-Dot Cellular Automata." IEEE of Missing Cell Defects in Quantum-Dot Cellular Automata." IEEE International Workshop on Design and Test of Defect-Tolerant Nanoscale International Workshop on Design and Test of Defect-Tolerant Nanoscale Architectures (NANOARCH '05) in conjunction with the VLSI Test Architectures (NANOARCH '05) in conjunction with the VLSI Test Symposium. Palm Springs, CA. May 1, 2005Symposium. Palm Springs, CA. May 1, 2005

CSE: Biometrics

EE/CSE: Quantum Comp

Page 18: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Recent Papers SupportedRecent Papers Supportedby Cycles from Condor at ND (3)by Cycles from Condor at ND (3)

On Deriving Good LDPC Convolutional Codes, A. E. Pusane, R. Smarandache, On Deriving Good LDPC Convolutional Codes, A. E. Pusane, R. Smarandache, P. O. Vontobel, and D. J. Costello, Jr, submitted to IEEE International P. O. Vontobel, and D. J. Costello, Jr, submitted to IEEE International Symposium on Information Theory, Nice, France, June 2007.Symposium on Information Theory, Nice, France, June 2007.A Comparison of ARA- and Protograph-Based LDPC Block and Convolutional A Comparison of ARA- and Protograph-Based LDPC Block and Convolutional Codes, D. J. Costello, Jr., A. E. Pusane, C. Jones, and D. Divsalar, to appear in Codes, D. J. Costello, Jr., A. E. Pusane, C. Jones, and D. Divsalar, to appear in Proc. Information Theory and Applications Workshop, San Diego, CA, USA, Proc. Information Theory and Applications Workshop, San Diego, CA, USA, January 29-February 2, 2007.January 29-February 2, 2007.LDPC Convolutional Codes: What Are They? How Do They Work? Are They Any LDPC Convolutional Codes: What Are They? How Do They Work? Are They Any Good?, D. J. Costello, Jr. and A. E. Pusane in Book of Abstracts, AMS Joint Good?, D. J. Costello, Jr. and A. E. Pusane in Book of Abstracts, AMS Joint Mathematics Meetings, New Orleans, LA, USA, January 5-8, 2007.Mathematics Meetings, New Orleans, LA, USA, January 5-8, 2007.L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Algebraic Superposition of L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Algebraic Superposition of LDGM Codes for Cooperative Diversity'' submitted to IEEE International LDGM Codes for Cooperative Diversity'' submitted to IEEE International Symposium on Information Theory (ISIT) 2007.Symposium on Information Theory (ISIT) 2007.L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Cooperative diversity L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Cooperative diversity based on code superposition'' in IEEE International Symposium on Information based on code superposition'' in IEEE International Symposium on Information Theory (ISIT), Seattle, WA, July 2006. Theory (ISIT), Seattle, WA, July 2006. L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Nested codes with multiple L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Nested codes with multiple interpretations'' in 40th Conference on Information Sciences and Systems (CISS), interpretations'' in 40th Conference on Information Sciences and Systems (CISS), Princeton, NJ, March 2006.Princeton, NJ, March 2006. EE: Signal Coding

Page 19: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Recent Papers SupportedRecent Papers Supportedby Cycles from Condor at ND (4)by Cycles from Condor at ND (4)

CSE: Scientific Databases

CSE: Network Simulation

Yingxin Jiang, Aaron Striegel, "A Distributed Traffic Control Scheme based on Edge-Centric Resource Management," ACM Computer Communications Review, vol. 36, no. 2, pp. 5-16, April 2006.Effects of low-quality computation time estimates in policed schedulers, Justin M. Wozniak, Yingxin Jiang and Aaron Striegel, Proc. Annual Simulation Symposium, IEEE Computer Society, 2007.D. Salyers, A. Striegel "A Novel Approach for Transparent Bandwidth Conservation,“ Proceedings of Networking 2005, Waterloo Ontario Canada, May 2005

Access Control for a Replica Management Database, Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, Jesus Izaguirre, ACM Workshop on Storage Security and Survivability (StorageSS), October 2006. Generosity and Gluttony in GEMS: Grid Enabled Molecular Simulations, Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, and Jesus Izaguirre, in Proceedings of the IEEE Symposium on High Performance Distributed Computing, July 2005

Page 20: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Recent Papers SupportedRecent Papers Supportedby Cycles from Condor at ND (5)by Cycles from Condor at ND (5)

Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid, Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J. Flynn, Workshop on Large Scale and Volatile Desktop Grids, March 2006.Operating System Support for Space Allocation in Grid Storage Systems, Operating System Support for Space Allocation in Grid Storage Systems, Douglas Thain, IEEE Conference on Grid Computing, September 2006. Douglas Thain, IEEE Conference on Grid Computing, September 2006. The Consequences of Decentralized Security in a Cooperative Storage The Consequences of Decentralized Security in a Cooperative Storage System, Douglas Thain, Chris Moretti, Paul Madrid, Phil Snowberger, and System, Douglas Thain, Chris Moretti, Paul Madrid, Phil Snowberger, and Jeff Hemmes, IEEE Workshop on Security in Storage (SISW), San Jeff Hemmes, IEEE Workshop on Security in Storage (SISW), San Francisco, December 2005. Francisco, December 2005. Separating Abstractions from Resources in a Tactical Storage System, Separating Abstractions from Resources in a Tactical Storage System, Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Striegel, and Jesus Izaguirre, in Proceedings of IEEE/ACM Striegel, and Jesus Izaguirre, in Proceedings of IEEE/ACM Supercomputing, Nov 2005. Supercomputing, Nov 2005. Patisserie: Support for Parameter Sweeps in a Fault-Tolerant, Massively Patisserie: Support for Parameter Sweeps in a Fault-Tolerant, Massively Parallel, Peer-to-Peer Simulation Environment, Timothy Schoenharl, Scott Parallel, Peer-to-Peer Simulation Environment, Timothy Schoenharl, Scott Christley, and Douglas Thain, Workshop on Agent Directed Simulation Christley, and Douglas Thain, Workshop on Agent Directed Simulation (ADS), San Diego, California, April 2005.(ADS), San Diego, California, April 2005.

CSE: Grid Computing

Page 21: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

How does Condor relate to CRC?How does Condor relate to CRC?

Use the CRC clusters for:Use the CRC clusters for:– CPU-intensive, fine-grained parallel codes.CPU-intensive, fine-grained parallel codes.– The latest, fastest machines.The latest, fastest machines.– Professional, continuous support.Professional, continuous support.

Use the Condor pool for:Use the Condor pool for:– Coarse grained, naturally parallel codes.Coarse grained, naturally parallel codes.– Harnessing college/dept level machines.Harnessing college/dept level machines.– Integration with distributed storage.Integration with distributed storage.– Building and deploying novel systems for computer Building and deploying novel systems for computer

science research.science research.– Self-service support at this point.Self-service support at this point.

(Some ambitious students use both!)(Some ambitious students use both!)

Page 22: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

How does Condor relate to OSG?How does Condor relate to OSG?

The Open Science GridThe Open Science Grid– A wide-area consortium of universities.A wide-area consortium of universities.– A mechanism (Condor+Globus) to access A mechanism (Condor+Globus) to access

remote batch/storage systems over the WAN.remote batch/storage systems over the WAN.– Interface (Condor-G) is one piece of Condor.Interface (Condor-G) is one piece of Condor.

The ND Condor PoolThe ND Condor Pool– A campus-scale collection of resources.A campus-scale collection of resources.– Could be made accessible via OSG interface.Could be made accessible via OSG interface.– Indirectly part of OSG/TeraGrid via Purdue.Indirectly part of OSG/TeraGrid via Purdue.

Page 23: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Example of Application/CS Example of Application/CS Research Using CondorResearch Using Condor

Page 24: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Scalable I/O for BiometricsScalable I/O for Biometrics

Computer Vision Research Lab in CSEComputer Vision Research Lab in CSE– Goal: Develop robust algorithms for identifying Goal: Develop robust algorithms for identifying

humans from (non-ideal) images.humans from (non-ideal) images.– Technique: Collect lots of images. Think up Technique: Collect lots of images. Think up

clever new matching function. Compare them.clever new matching function. Compare them.

How do you test a matching function?How do you test a matching function?– For a set S of images,For a set S of images,– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.– Compare the result matrix to known functions.Compare the result matrix to known functions.

Credit: Patrick Flynn at Notre Dame CSE

Page 25: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Computing SimilaritiesComputing Similarities

11 .8.8 .1.1 00 00 .1.1

11 00 .1.1 .1.1 00

11 00 .1.1 .7.7

11 00 00

11 .1.1

11

F

Page 26: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

A Big Data ProblemA Big Data Problem

Data Size: 10k images of 1MB = 10 GBData Size: 10k images of 1MB = 10 GB

Total I/O: 10k * 10k * 2 MB *1/2 = Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB100 TB

Would like to repeat many times!Would like to repeat many times!

In order to execute such a workload, we In order to execute such a workload, we must be careful to partition both the I/O must be careful to partition both the I/O and the CPU needs, taking advantage of and the CPU needs, taking advantage of distributed capacity. distributed capacity.

Page 27: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

Conventional SolutionConventional Solution

DiskDisk

DiskDisk

Job JobJobJob Job JobJobJob

Move 200 TB at Runtime!

Page 28: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

Using Tactical StorageUsing Tactical Storage

1. Break array into MB-size chunks.

3. Jobs find nearby data copy, and make full use before discarding.

Job Job Job Job

2. Replicate data to many disks.

Result: achieve greater than 2Gb/s of disk->application bandwidth on large workload

Page 29: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Technical Issues (1)Technical Issues (1)DeploymentDeployment– All codes and config in AFS,All codes and config in AFS,

just deploy startup script in /etc/init.d.just deploy startup script in /etc/init.d.– Manual copy onto each node gets lost at the Manual copy onto each node gets lost at the

end of the semester, copy into image.end of the semester, copy into image.

FirewallsFirewalls– TCP/UDP on ports 9000-1000 both directions.TCP/UDP on ports 9000-1000 both directions.– One firewalled machine can hang everyone!One firewalled machine can hang everyone!– Workaround: Periodic check of TCP ports, Workaround: Periodic check of TCP ports,

manually disable Condor on FW nodes. manually disable Condor on FW nodes.

Page 30: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

Technical Issues (2)Technical Issues (2)Disappearing ServersDisappearing Servers– Problem: condor_master on each host Problem: condor_master on each host

disappears mysteriously; pool decays.disappears mysteriously; pool decays.– Diagnosis: AFS outage? Condor bug?Diagnosis: AFS outage? Condor bug?– Solution: /etc/cron.hourly/restart_condorSolution: /etc/cron.hourly/restart_condor

CPU DetectionCPU Detection– Problem: Hyperthreaded machines appearProblem: Hyperthreaded machines appear

to be multi-CPU machines on Linux.to be multi-CPU machines on Linux.– Result: Condor overcommits the CPU.Result: Condor overcommits the CPU.– Solution: Manual override NUM_CPUS=1Solution: Manual override NUM_CPUS=1

Page 31: Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006

SummarySummary

With your help, our Condor pool has With your help, our Condor pool has provided significant benefits for both provided significant benefits for both research and education. Thank you!research and education. Thank you!Liaison between faculty and staff at the Liaison between faculty and staff at the dept, college, and univ level is needed to dept, college, and univ level is needed to keep the system working.keep the system working.Lots more info here:Lots more info here:– http://www.nd.edu/~condorhttp://www.nd.edu/~condor– [email protected]@listserv.nd.edu