7
Advanced Topics in Distributed Systems Monday, Wednesday, and Friday from 1:30-2:20 RHPH 162 (note change) Chris Clifton Email: Design and control of distributed computing systems (operating systems and database systems). Topics include principles of naming and location, atomicity, resource sharing, concurrency control and other synchronization, deadlock detection and avoidance, security, distributed data access and control, integration of operating systems and computer networks, distributed systems design, consistency control, and fault tolerance. A more detailed course description prepared for the CEE program is available, as is a course preview briefing containing more detailed information on requirements and expectations. The course outline is given below. More course information may be available in WebCT (direct link ). Please add yourself to the course mailing list. Send mail to [email protected] containing the line: add your email to cs603 Feel free to send things to the course mailing list if you feel it is appropriate. An example might be a pointer to a particularly helpful on- line manual describing an API used in one of the projects. Course Methodology The course will be taught through lectures, with class participation expected and encouraged. There will be frequent reading assignments to supplement the lectures. For now, Professor Clifton will not have regular office hours. Feel free to drop by anytime, or send email with some suggested times to schedule an appointment. You can also try H.323/T.120 desktop videoconferencing (e.g., SunForum , Microsoft NetMeeting .) You can try opening an H.323 connection to blitz.cs.purdue.edu - send email if there is no response.

<?xml version="1.0" encoding="UTF-8"?>

Embed Size (px)

DESCRIPTION

. . . . . Advanced Topics in Distributed Systems. . - PowerPoint PPT Presentation

Citation preview

Page 1: <?xml version="1.0" encoding="UTF-8"?>

Advanced Topics in Distributed SystemsMonday, Wednesday, and Friday from 1:30-2:20

RHPH 162 (note change)

Chris Clifton

Email:

Design and control of distributed computing systems (operating systems and database systems). Topics include principles of naming and location, atomicity, resource sharing, concurrency control and other synchronization, deadlock detection and avoidance, security, distributed data access and control, integration of operating systems and computer networks, distributed systems design, consistency control, and fault tolerance.

A more detailed course description prepared for the CEE program is available, as is a course preview briefing containing more detailed information on requirements and expectations. The course outline is given below.

More course information may be available in WebCT (direct link).

Please add yourself to the course mailing list. Send mail to [email protected] containing the line:

add your email to cs603

Feel free to send things to the course mailing list if you feel it is appropriate. An example might be a pointer to a particularly helpful on-line manual describing an API used in one of the projects.

Course Methodology

The course will be taught through lectures, with class participation expected and encouraged. There will be frequent reading assignments to supplement the lectures.

For now, Professor Clifton will not have regular office hours. Feel free to drop by anytime, or send email with some suggested times to schedule an appointment. You can also try H.323/T.120 desktop videoconferencing (e.g., SunForum, Microsoft NetMeeting.) You can try opening an H.323 connection to blitz.cs.purdue.edu - send email if there is no response.

Prerequisites

The official requirement is CS 503 (Operating systems), with CS 542 (Distributed Database systems) recommended. The practical requirement is a solid undergraduate background in computer science including some database and operating systems theory, and substantial programming experience. If you don't have 503, but feel you have sufficient background, please send me an explanation of why you feel you are prepared, along with a number/times for me to call and discuss approving your registration.

Text

The following is recommended (it will be a useful reference for much of the lab work in the course):

Internetworking with TCP/IP Vol.III: Client-Server Programming and Applications,D. E. Comer and D. Stevens,Prentice Hall,

Page 2: <?xml version="1.0" encoding="UTF-8"?>

(choose appropriate version for your favorite platform),0-13-032071-4

The following have been recommended in the past, and may provided useful background reading. However, none are required.

Distributed Systems, 1993Sape MullenderPrentice Hall0-201-62427-3

Distributed Algorithms, 1997Nancy LynchMorgan Kaufmann1-55860-348-4

Distributed Operating Systems, 1995TanenbaumPrentice Hall0-13-219908-4

Evaluation/Grading:

Evaluation will be a subjective process, however it will be based primarily on your understanding of the material as evidenced in:

• Midterm Exam (25%) • Final Exam (35%) • Projects (4-5) (40%)

Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted. (If everyone has a notebook with wireless connection and all agree they want to use them in the exams, I could relax this.)

I will evaluate projects on a five point scale:

5Exceptional work. So good that it makes up for substandard work elsewhere in the course. These will be rare.

4What I'd expect of a Ph.D. candidate. This corresponds to an A grade.

3Good enough for a Master's degree, but not what I'd like to see for a Ph.D. candidate. This corresponds to a B grade.

2Okay for a Master's candidate who does extremely well in other courses. This corresponds to a C grade.

1Not good enough for a graduate student. But something.

0Missing work, or so bad that you needn't have bothered.

Projects

A substantial portion of your education in this course will come through performing programming projects: building components of a distributed system. Some examples of what projects might involve are:

Page 3: <?xml version="1.0" encoding="UTF-8"?>

• Building a server capable of handling multiple simultaneous TCP/IP connections using the Socket API. The server would be trivial (e.g., calculate the square of the input and return the result after a five second delay), the key effort would be the API.

• Implement an application that connects to a (provided) CORBA server. • Implement a clock synchronization protocol.

My current expectation is that all projects will be done individually, as it is probable that some of the CEE students will not be collocated with other students in the course.

Policy on Intellectual Honesty

Please read the above link to the policy written by Professor Spafford. This will be followed unless I provide written documentation of exceptions.

Late work will be penalized except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)

The penalty for late work is 1 point (of the possible 5) if turned in after the deadline, and one additional point for each week late.

Syllabus (numbers correspond to week):

Project start/due dates are tentative!

1. Course overview , Components of a distributed system 2. Communication Mechanisms

• Message Passing • Stream-oriented communications • Remote Procedure Call • Remote Method Invocation

3. Remote Method Invocation: Mechanisms • DCE RPC (reading)

First project starts January 23 • Java RMI (reading) • SOAP (Reading: SOAP 1.1 spec, XML Protocol Working Group, Apache SOAP)

4. Naming• Overview • X.500/LDAP • Active Directory (reading)

First project design due January 30 5. Clock Synchronization

• What is clock synchronization? Leslie Lamport, "Time, clocks, and the ordering of events in a distributed system", Communications of the ACM 21(7) (July 1978).

• Possibility and impossibility Lundelius, J. and Lynch, N., "An Upper and Lower Bound for Clock Synchronization," Information and Control, Vol. 62, Nos. 2/3, pp. 190-204, 1984. Danny Dolev, Joe Halpern, and H. Raymond Strong, "On the possibility and impossibility of achieving clock synchronization", Journal of Computer and System Sciences 32(3) 230-250. April 1986. Michael J. Fischer, Nancy A. Lynch, and Michael Merritt, "Easy impossibility proofs for distributed consensus problems" Proceedings of the fourth annual symposium on Principles of distributed computing 1985 , Minaki, Ontario, Canada.

• Practical solution: NTP (Reading)

Page 4: <?xml version="1.0" encoding="UTF-8"?>

Other Reading: Leslie Lamport and P. M. Melliar-Smith, "Synchronizing clocks in the presence of faults" Journal of the ACM 32(1) (January 1985). Jennifer Lundelius and Nancy Lynch, "A new fault-tolerant algorithm for clock synchronization, Proceedings of the third annual ACM symposium on Principles of distributed computing 1984 , Vancouver, British Columbia, Canada.

First project due February 11. 6. Process Synchronization

• Overview : Global State, Mutual ExclusionLeslie Lamport, ``The Mutual Exclusion Problem'', Journal of the ACM 33(2) (April 1986). Read Part II section 2 - the rest is optional. Leslie Lamport, ``1983 Invited address: Solved problems, unsolved problems and non-problems in concurrency, Proceedings of the third annual ACM symposium on Principles of distributed computing, 1984, Vancouver, British Columbia, Canada. Optional - Global State:K. Mani Chandy and Leslie Lamport, ``Distributed Snapshots: Determining Global States of Distributed Sytems'', ACM Transactions on Computer Systems 3(1) (February 1985) 63-75.

• Fault Tolerant Solutions Michael J. Fischer, Nancy A. Lynch, James E. Burns and Allan Borodin, ``Distributed FIFO allocation of identical resources using small shared space'' ACM Transactions on Programming Languages and Systems 11(1) (1989) pp. 90-114.

• Multiple resources Requirements Please don't check these out - others may want to read them.Dijkstra, E. ``Hierarchical Ordering of Sequential Processes'', ACTA Informatica 1 (1971), 115-138. M. Rabin and D. Lehmann, ``On the Advantages of Free Choice: A Symmetric and Fully Distributed Solution to the Dining Philosophers Problem'', Proceedings of the 8th Symposium on Principles of Programming Languagues (1981) pp. 133-138.

Second project starts February 15. 7. Distributed Transactions

• 2-Phase Commit • Formal Models for failure and recovery • 3-Phase Commit

Reading:Skeen, Dale, ``A Formal Model of Crash Recovery in a Distributed System,'' IEEE Transactions on Software Engineering 9(3), May 1983, pp.219-228. (preliminary on-line version from SIGMOD'81) Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Concurrency Control and Recovery in Database Systems, Chapter 7: Distributed Recovery, Addison Wesley, 1987.

8. Distributed Data: Replication • Basics

Reading:Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Concurrency Control and Recovery in Database Systems, Chapter 8: Replicated Data, Addison Wesley, 1987.

• Example: Replication in Oracle • Advanced Techniques: Quasi-Copies

Reading:Rafael Alonso, Daniel Barbará, and Hector Garcia-Molina, ``Data caching issues in an information retrieval system'', ACM Transactions on Database Systems (TODS) 15(3), September 1990.

Second project due March 1. 9. Mid-Semester Review

March 8, in class: Midterm on material from weeks 1-7. 10.Processes, code migration

Page 5: <?xml version="1.0" encoding="UTF-8"?>

Third project starts March 20. • Threads vs. Processes, Code migration basics • Mobile Agents • Mobile Agents example: D'Agents

Reading: D'Agents web site, position paper. 11.Distributed Object systems: CORBA (OMG)

Reading: CORBA Overview from The Common Object Request Broker: Architecture and Specification, OMG group, 2001.CORBA Security Service (reading). Third project due April 3, fourth project starts.

12.Distributed Object Systems: • DCOM

Reading: DCOM vs. .NET • .NET

13.Distributed Coordination: Jini. Further reading: Jan Newmarch's Guide to JINI Technologies. 14.Fault Tolerance

• Failure models . Reading:Dr. Flaviu Cristian, Understanding Fault-Tolerant Distributed Systems, Communications of the ACM 34(2) February 1991.

• Fault Tolerance Reading:Felix C. Gärtner, Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments ACM Computing Surveys 31(1), March 1999.

• Reliable communication • Recovery Optional reading:

Richard Golding and Elizabeth Borowsky, Fault-Tolerant Replication Management in Large-Scale Distributed Storage Systems, in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems 18-21 October, 1999, Lausanne, Switzerland.Hector Garcia-Molina, Christos A. Polyzois and Robert B. Hagmann, Two Epoch Algorithms for Disaster Recovery, in Proceedings of the 1990 conference on Very Large Data Bases, Brisbane, Australia, August 13-16 1990.

Fourth project due April 19. 15.Review

Final exam Thursday, May 2, 2002 from 1:00pm to 3:00pm in RHPH 164.