28
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework David Groep Nikhef release 8

INFSO-RI-508833 Enabling Grids for E-sciencE gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Embed Size (px)

Citation preview

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

gLExec, SCASand the paths forward

Introduction to pilot jobs and gLExec and SCAS framework

David GroepNikhef

release 8

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Outline

• Late Binding and the Distribution of Access Control• Distributing site access control in-depth using gLExec• gLExec deployment scenarios• Coordinating Site Access Control with SCAS

gLExec, SCAS, and the road towards distributes access control 2

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Jobs: from early to late binding

User submits his jobs to a resource through a ‘cloud’ of intermediaries

Direct binding of payload and submitted grid job• job contains all the user’s business• access control is done at the site’s edge• inside the site, the user job has a specific, site-local, system identity

gLExec, SCAS, and the road towards distributes access control 3

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Binding Late

user’s system for job management

job container binds to actual workload

Late binding of work load using ‘pilot jobs’• generic job containers are sent, which can verify the ‘surroundings’• retrieve payload from a repository ‘elsewhere’• if the repository is run by the user, on a per-user bases, then it is likely that it’s the users’ payload – if communication is secure

gLExec, SCAS, and the road towards distributes access control 4

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Multi-User Pilot Jobs

What if the user ‘outsources’ the running of the pilot jobs?• then whoever runs the pilot jobs, will run workload for multiple users• but the site only grants access to the ‘service provider’ (VO) …

gLExec, SCAS, and the road towards distributes access control 5

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Impact of late binding on sites and credentials

At the site itself, what does a user job look like?

gLExec, SCAS, and the road towards distributes access control 6

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Pushing access control downwards

gLExec, SCAS, and the road towards distributes access control 7

Classic model

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Pushing access control downwards

gLExec, SCAS, and the road towards distributes access control 8

Multi-user pilot jobs hiding in the classic model

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

MUPJ security issues

With multi users use a common pilot job deployment Users, by design, will use the same account at the site

•Accountabilityno longer clear at the site who is responsible for activity

•Integritya compromise of any user using the MUPJ framework ‘compromises’ the entire framework

the framework can’t protect itself against such compromiseunless you allow change of system uid/gid

•Site access control policies are ignored

•… and several more …

gLExec, SCAS, and the road towards distributes access control 9

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Pushing access control downwards

gLExec, SCAS, and the road towards distributes access control 10

Making multi-user pilot jobs explicit with distributedSite Access Control (SAC)

- on a cooperative basis -

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Implementing distributed SAC

Component 1: gLExec

a thin layerto change Unix domain credentials

based on grid identity and attribute information

you can think of it as:• ‘a replacement for the gatekeeper’

• ‘a griddy version of Apache’s suexec’

• ‘a program wrapper around LCAS, LCMAPS or GUMS’

gLExec, SCAS, and the road towards distributes access control 11

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Pilot Jobs and gLExec

On success: gLExec will set the uid/gid to the new user’s job and execute it

On failure: gLExec returns with an error, and pilot job can terminate or obtain other user’s job

gLExec, SCAS, and the road towards distributes access control 12

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

gLExec deployment modes

• Identity Mapping Mode – ‘just like on the CE’– have the VO query (and by policy honour) all site policies– actually change uid based on the true user’s grid identity– enforce per-user isolation and auditing using uids and gids– requires gLExec to have setuid capability

• Non-Privileged (‘Logging Only’) Mode – declare only– have the VO query (and by policy honour) all site policies– do not actually change uid: no isolation or auditing per user– the gLExec invocation will be logged, with the user identity– does not require setuid powers – job keeps running in pilot space

• ‘Empty Shell’ – do nothing but execute the command…

gLExec, SCAS, and the road towards distributes access control 13

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Identity change

Let’s assume you make it setuid. Fine. Where to map to:

• To a shared set of common pool accounts– Uid and gid mapping on CE corresponds to the WN– Requires SCAS or shared state (gridmapdir) directory– Clear view on who-does-what

• To a per-WN set of pool accounts– No site-wide configuration needed– Only limited (and generic) set of pool uids on the WN– Need only as many pool accounts as you have job slots– Makes cleanup easier, ‘local’ to the node

• Or something in between ... e.g. 1 pool for CE other for WN

But if it is not setuid, it cannot isolate & protect the pilot.gLExec, SCAS, and the road towards distributes access control 14

Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833 gLExec: gluing grid computing to the Unix world – CHEP 2007 15

But all pieces should go together

1. glexec on the worker-node deployment

2. way to keep the pilot jobs submitters to their word– mainly: monitor for compromised pilot submitters credentials– system-level auditing of the pilot jobs,

but auditing data on the WN is useful for incident investigations only

3. ‘internal accounting should be done by the VO’– the regular site accounting mechanisms are via the batch system, and

these will see the pilot job identity– the site can easily show from those logs the usage by the pilot job

– making a site do accounting based glexec jobs is non-standard, and requires non-trivial effort

Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Batch system and OS compatibility

How does gLExec affect the basic functions of a batch system?

1. Job Submission

2. Job Suspend/Resume

3. Job Kill

4. CPU time accounting– No change with respect

to current behaviour of jobs

– Times are accumulated on wait and collated with the gLExec usage

by keeping the process tree, gLExec is transparent for the

tested batch systems

tests based on work by Ulrich Schwickerath

gLExec, SCAS, and the road towards distributes access control 16

Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

gLExec: where are we now?

You can deploy without changes if• you run LSF or Torque and

don’t manage disk or processes• you run LSF or Torque and

use TMPDIR and process-tree based style job slaughtering

You should update your scripts to use the back-mapping dir if• you use LSF or Torque and use uid recognition for pruning

stray processes (but you ought to change this anyway)• you use uid recognition for file cleaning

gLExec, SCAS, and the road towards distributes access control 17

Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

What Happens to Access Control?

So, as the workload binding get pushed deeper into the site, access control by the site has to become layered as well …

… how does that affect site access control software and its deployment ?

gLExec, SCAS, and the road towards distributes access control 18

Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Site Access Control today

PRO already deployedno need for external components, amenable to MPI

CON when used for MU pilot jobs, all jobs run with a single identityend-user payload can back-compromise pilots, and cross-infect other jobsincidents impact large community (everyone utilizing the MUPJ framework)

gLExec, SCAS, and the road towards distributes access control 19

Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Node-local access control

PRO no single points of failurewell defined number of pool accounts (as many as there are job slots/node)containment of jobs (no cross-WN infection)

CON need to distribute the policy through fabric management/config toolsno cross-workernode mapping (e.g. no support for pilot-launched MPI)

gLExec, SCAS, and the road towards distributes access control 20

Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

WN-coordinated access control

PRO single unique account mapping per user across whole farm, CE, and SEtransactions database is simple (implemented as an NFS file system)communications protocol is well tested and well known

CON need to distribute the policy through fabric management config toolscoordination only applies to the account mapping, not to authorization

gLExec, SCAS, and the road towards distributes access control 21

Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Site-central access control

PRO single unique account mapping per user across whole farm, CE, and SE*can do instant banning and access control in a single placeprotocol profile allows interop between SCAS and GUMS (but no others!)

CON replicated setup for redundancy needed for H/A sitesstill cannot do credential validation (formalistic issues with the protocol)

gLExec, SCAS, and the road towards distributes access control 22* of course, central policy and distributed

per-WN mapping also possible!

Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Centralizing decentralized SAC

Supporting consistent • policy management• mappings (if the are not WN-local)• banning

via the

Site Central Authorization Service SCAS– network wrapper around LCAS and LCMAPS– it’s a variant-SAML2XAML2 client-server– it is itself access controlled

gLExec, SCAS, and the road towards distributes access control 23

Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Local LCMAPS

gLExec, SCAS, and the road towards distributes access control 24

• Linked dynamically or statically to application• does both credential acquisition

- local grid map file- VOMS FAQN to uid and gids

• and enforcement- setuid- krb5 token requests- AFS tokens- LDAP directory update

LCAS is similar is use and design, but makes the basic Yes/No decision

Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

SCAS: LCMAPS in the distance

gLExec, SCAS, and the road towards distributes access control 25

• Application links LCMAPS dynamically or statically, or includes Prima client• Local side talks to SCAS using a variant-SAML2XACML2 protocol

- with agreed attribute names and obligation between EGEE/OSG- remote service does acquisition and mappings- both local, VOMS FAQN to uid and gids, etc.

• Local LCMAPS (or application like gLExec) does the enforcement

Page 26: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Talking to SCAS

• From the CE– Connect to the SCAS using the CE host credential– Provide the attributes & credentials of the service requester, the

action (“submit job”) and target resource (CE) to SCAS– Using common (EGEE+OSG+GT) attributes– Get back: yes/no decision and uid/gid/sgid obligations

• From the WN with gLExec– Connect to SCAS using the credentials

of the pilot job submitterAn extra control to verify the invoker of gLExec is indeed an authorized pilot runner

– Provide the attributes & credentials of the service requester, the action (“run job now”) and target resource (CE) to SCAS

– Get back: yes/no decision and uid/gid/sgid obligations

• The obligations are now coordinated between CE and WNsgLExec, SCAS, and the road towards distributes access control 26

Page 27: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Where does SCAS go?

SCAS is the medium-term answer to distributed access control– Going to central certification now– Testing by SA3/AMS shows well over 25 Hz performance

(speed was limited only by available number of client nodes,where bandwidth is limited by running in virtual machines)

– ‘bonus’ features (like central credential validation) may be added on demand – ask if you want this

Long-term solution is part of the new Authorization Framework• new Execution Environment Service (EES) will • take care of the account mapping &c, • using technology elements from SCAS• and leveraging the other AuthZ components for policy

administration, coordinated policy decisions and enforcementgLExec, SCAS, and the road towards distributes access control 27

Page 28: INFSO-RI-508833 Enabling Grids for E-sciencE  gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework

Enabling Grids for E-sciencE

INFSO-RI-508833

Questions?

QgLExec, SCAS, and the road towards distributes access control 28