33
Brian Bockelman/UNL OSG Agency Review August 19, 2014 Technology and Software Meeting research needs through advances, delivery and support.

Technology and Software - indico.fnal.gov fileThe Technology and Software team focuses on providing a ... The Compute Element (CE) is the ... offsite and submits them to a batch system.!

  • Upload
    vuduong

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Technology and Software

Meeting research needs through advances, delivery and support.

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Overview

✦ The Technology and Software team focuses on providing a production-quality set of software to users and resource providers for DHTC and evolving the technology used by the OSG in the medium term.!★We focus on integrating technologies from external projects as opposed to developing

our own software.!★We work on advancing and refining the science of DHTC (as we have throughout the

lifetime of OSG).!✦ Technology is a balancing act between the torrent of new technologies and

maintaining older capabilities.!★We also must balance between meeting stakeholder needs and progressing our

internal vision.!★ By continuing to simplify the existing OSG Software Stack, we will to address new

challenges in Year 4 and 5 but stay within the same effort profile.!✦ The evolution the OSG Software Stack helps us keep abreast of the state of the

art in DHTC. All technology has a lifecycle in the OSG:!★We investigate potential new - sometimes immature - technologies that we believe will

improve DHTC.!★We integrate and release software into the OSG Software Stack. !★We maintain, deprecate, and remove software at the end of its lifecycle.

2

Brian Bockelman/UNL OSG Agency Review August 19, 2014

OSG Technology Area

✦ The investigations team (1.8 FTE): Focuses on identifying and introducing technologies that could be disruptive in the 6 months to 2 year timescale.!

✦ The software team (5.8 FTE): Responsible for maintaining the software stack. Ideally mostly maintenance (bug fixes), nightly testing, and integration.!★ This is also the effort which performs development as

necessary.!✦ The release team (1.5 paid from project funds; 1.9 total

FTE): Responsible for the monthly release process - including acceptance testing, documentation, and release management.!★ The responsibility for release management was moved to this

separate team in 2013..

3

Decreasing tim

e horizon

A success in this phase of the OSG has been to integrate these three responsibilities into a single software area.

Brian Bockelman/UNL OSG Agency Review August 19, 2014

The Beginning

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Beginning of the Software Lifecycle

✦ I’ll now give three examples of new technologies and efforts from the last three years: HTCondor-CE, removing user certificates, and simplifying data management.!

✦ Focus is to significantly simplify either the user experience or site operations.!

✦ Onboarding new technologies is not a simple process:!★Can it solve our problem? Does it fit into the vision of DHTC? Can we

support it? Does it fit with our existing software solution? Will shortcomings be fixed?!

✦ Even 8 years into the project, this remains an essential activity - many early technical decisions were made for us by the stakeholders. We must stabilize and lead, changing the decisions to help DHTC succeed for WLCG and the long tail of science.

5

Brian Bockelman/UNL OSG Agency Review August 19, 2014

New Components - HTCondor-CE

✦ The Compute Element (CE) is the set of software which accepts pilot jobs from offsite and submits them to a batch system.!

✦ An internal review of our software components indicated our original core technology (GRAM) had a declining support profile, and we were unique in the scale of its use.!★We had the opportunity to consolidate the

set of technologies utilized.!✦ We are in the process of deploying a

new CE, HTCondor-CE. This is a particular configuration of HTCondor - not a new piece of software.!

✦ HTCondor-CE is more scalable, more robust, easier to debug, and thus less costly to operate.

6

Brian Bockelman/UNL OSG Agency Review August 19, 2014

OSG Security - Evolving Trust Relationships

✦ In the old model, users needed a credential (grid certificate) globally valid and understood by all resources. In the new model, only the VO needs such a credential.!

✦ As we are able to bring our overlay approach into production, it enables the new trust model.

7

Old Model

New Model

Resource

ResourceVO

User

User

User

User

User

User

Resource establishesdirect trust with users

Trusts theVO

Trusts theusers

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Traceability Project

✦ Trust is now recognized as transitive - resources trust the VO; the VO trusts the users; therefore, the resources trust the users.!★ A simpler trust model helps the resource provider share resources!!

✦ Resources (especially DOE) often have a traceability requirement - they need to demonstrate they know who is using the resource.!★ Fulfilling this was easier in the old model where users established a

relationship directly with the resource.!✦ OSG’s traceability project helps establish the trust in the new model.

Through technical assistance, audits, and VO reviews, we show resources can fulfill this requirement for certain VOs.!★ First system studied was GlideinWMS; primary beneficiaries have been

the OSG and GLOW VOs.!★ This enables sites with strict traceability requirements get out from

underneath a user-unfriendly model.

8

Brian Bockelman/UNL OSG Agency Review August 19, 2014

The Long Tail of Science: Simplifying Software Management

✦ Software installation has long been a serious problem - originally, we asked sites to provide a NFS mount. VOs were responsible for keeping software installs synchronized across dozens of sites.!

✦ The WLCG introduced CVMFS, a read-only global file system which distributes files through a network of HTTP caches.!★OSG identified the potential of the concept and developed/deployed CVMFS

as a service for our VOs, branded OASIS. Took close collaboration between Technology and Operations.!

★ In 2013, VOs could install their software once at GOC and publish everywhere. Starting in 2014, VOs can manage their own repositories.!

★ The network of HTTP caches is built on common, widely-adopted HTTP technologies and is reused for other activities.!

✦ Software is data - just differing in scale! With the Intensity Frontier, OSG is working to evaluate alternate file distribution methods to see if CVMFS will scale for data management (targeting dataset sizes of 100GB-1TB).!★We believe there is opportunity in treating this data in a uniform way.

9

Brian Bockelman/UNL OSG Agency Review August 19, 2014

The Middle

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Middle of the Software Lifecycle

✦ The middle of the software lifecycle - mature software in production at sites - is the majority of our effort.!

✦ Focus is on keeping production quality:!★Release software as part of a cohesive, consistent set of packages.!★ Test and verify new software versions. Does it still work in concert with

our other software?!★ Attempt to get new features to sysadmins in a timely manner.!★Coordinate with the software developers to solve issues at sites and

organize feature requests.

11

Brian Bockelman/UNL OSG Agency Review August 19, 2014

The Software Factory

✦ One important service OSG provides is a “software factory” - the software, services, and processes for a coherent software distribution.!★ Logically, raw components (software packages) go in one side. A software

distribution comes out the other.!★We assemble/integrate the components, improve them, test them, and

distribute the results to the OSG Production Grid.!✦ We are expanding this to a “Software Factory Factory”; other organizations,

such as HTCondor, HCC, Internet2, and USCMS are investigating how to use our infrastructure to produce software distributions.!

✦ In the early days, many pieces of grid software were of “research project quality” with packaging not meeting our standards. Sustained efforts in packaging have paid off - most software is now available from widely-known community repositories or has packaging maintained by developers.!★ A key enabler was the transition from niche packaging format to OS-native

packaging tools. Software projects had no interest in maintaining “OSG packaging” but have more interest in maintaining quality “RedHat packaging”.

12

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Example: HTCondor

✦ OSG Software team adds 8 patches; for example,!★ Patch startup script to integrate OSG-used security libraries.!★ Ensure generated grid proxies are at least 1024 bits.!

✦ Integrates it with other software packages, such as:!★Globus GRAM to add hooks for the OSG environment in grid jobs.!★GlideinWMS for pilot jobs.!

✦ Automated nightly tests include:!★ Submit jobs directly to the batch system.!★HTCondor-G job -> GRAM -> HTCondor backend.!

✦ Contributed our packaging back to the HTCondor project - they now perform most of the maintenance for us.

13

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Consolidating Technology - HTCondor

✦ During this part of the lifecycle, investments in the technology may bear fruit across several areas.!

✦ A key new technology is the ability to submit jobs remotely using SSH instead of a dedicated gatekeeper.!★OSG Technology helped test and validate scale.!★ This shares common components (“blahp”) with HTCondor-CE.!★Now forms core of the campus grids solution.!

✦ While HTCondor is an external project, OSG is recognized as a flagship community promoting DHTC. As Miron has mentioned, community efforts are even paying off internationally. With OSG help,!★ In 2013, the UK WLCG Tier-1 switch their local batch to HTCondor.!★ In 2014 CERN selected HTCondor as their new site batch system.

14

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Software Releases

✦ The release team is responsible for the final product sysadmins see: they are the final safeguards for quality.!★ In 2013, we created this separate team led by the release manager (who,

organizationally, is part of Operations not Technology).!✦ Fixes are verified in as realistic environments as possible. If possible,

real jobs are used.!★ Acceptance testing is a very different environment from the software

team’s nightly tests.!✦ We target a monthly software release and have a second monthly date

set aside for urgent releases (security issues, critical bugs).!★ 35 releases since the beginning of the current grant.!★Our average rate is 4 releases per quarter.!

✦ Bug reports are only closed once the release team verifies a fix and performs the corresponding release.!★ 383 resolved in 2013; 240 to date in 2014.

15

Brian Bockelman/UNL OSG Agency Review August 19, 2014

The End

Brian Bockelman/UNL OSG Agency Review August 19, 2014

The Software Orphanage

✦ When a key piece of software dies - runs out of funding, is abandoned by the developers, goes in an incompatible direction - OSG must shoulder the support costs in order to maintain quality or capability.!★ The set of software in this state is known as the “software orphanage”.!★ The goal of the software orphanage is to ease the pain of its removal from

OSG.!★We will maintain and update the software while we help stakeholders

replace it with something else.!✦ This presents a continuous struggle: without retiring old software, we

would end up spending all our effort here!

17

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Example: Bestman2

✦ The Bestman2 software implements a protocol called SRM (Storage Resource Management); essential for interoperability with the WLCG.!★ SRM has proven fairly resilient to attempts to retire it; as a niche grid

protocol, there’s no new implementations.!★ Its developers ran out of funding about three years ago.!

✦ In 2012 / 2013, we had to make a major investment in this software and its dependencies to support new signing algorithms for SSL.!★While the preference is to retire, this is an example of our ability to

maintain software for the stakeholders (who had no capability to do this themselves). This is an essential service.

18

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Outlook

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Outlook

✦ This is a large area - necessarily so, as we must cover technologies through their entire lifecycle.!★Our most visible “output” is an integrated stack of software for

accomplishing DHTC.!★Due to our software expertise, there’s significant draw on this area from

other areas. Technology intersects networking, security, user support, and campus grids; we often loan effort for specific projects.!

★We plan to continue the software evolution through the next two years. We will “draw down” effort on older components and move it into new challenges. We will decrease the number of software components we ship by 25% in the next two years.!

✦ The OSG Software Stack is moving to execute our DHTC vision while providing a production platform for existing users.

20

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Questions?

✦ Some additional material contained in backup slides; I’ll be happy to answer questions on that material too.

21

Brian Bockelman/UNL OSG Agency Review August 19, 2014

BACKUP SLIDES

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Significant Accomplishments

✦ Completed the transition to native packaging formats (RPM).!★ Packaging infrastructure is now more friendly to outsiders.!★ Pacman retired at end of Year 1.!

✦ Re-organization: Consolidated separate software and technology teams!✦ Re-organization: Created a separate release team with release

manager part of OSG Operations.!✦ Established an (approximately) monthly release cadence: 15 releases in

Year 1, 15 releases in Year 2, 4 releases (in three months) in Year 3.!✦ Major software advances:!★ “Uneventful” major upgrades of most components.!★ Added support for SHA-2 security algorithms.!★ Transitioned from Java 6 to 7.!★Upgraded supported version of HDFS to v2.0!★Release of HTCondor-CE.!★ Transition to CVMFS-centric software delivery.

23

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Year 4 - Major Technical Work Items

✦ Add support for RedHat Enterprise Linux 7.!✦ Package PanDA, a major workload management system.!✦ Verify the components of the OSG Software Stack are ready for a 2x

increase of scale during Run II.!✦ Improve our ability to allow OSG users to seamlessly utilize XD

allocations.!✦ Investigate various mechanisms to improve data delivery to jobs.!✦ Design / implement a new OSG client to better operate in the “overlay

world”.!✦ Review technical documentation.!✦ Improve automated testing.!✦ Contribute additional software packaging to community repositories.

24

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Details of Software Release Process

✦ Software team receives request or notice need for update!✦ Create ticket in JIRA (ticketing system). This ticket is maintained

throughout process.!✦ Download new source code and/or packaging to local storage. Update

packaging as needed (at least, version info).!✦ Build new package in Koji (build system), goes into development repository.!✦ The software team member tests the build for major flaws, and possibly

basic functionality.!✦ Promote package into testing repository. In a clean virtual machine, the

automated testing software installs and tests the integrated software.!★ At this point, the package is handed to the release team.!

✦ One or more testers from the release team test the new package in real installs, fully integrated and under realistic usage.!

✦ Do a final round of pre-release testing that integrates all updates.!✦ Release the package into production as part of the monthly release. The

ticket in JIRA is closed.25

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Software Ticket Rates

26

Roughly, we have kept apace of new issues, especially as we have been fully staffed during 2014. JIRA has been an essential system for tracking effort

across the distributed team.

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Major External Software Components

✦ We expect items marked with (*) to decrease in support costs over the next 2 years due to improvements in upstream packaging.!

✦ Globus (GridFTP, GSI, GRAM). (*)!✦ HTCondor. (*)!✦ CVMFS.!✦ HDFS. (*)!✦ Xrootd.!✦ LCMAPS framework and plugins (client-side authorization).!✦ dCache transfer clients (*)!✦ lcg-utils (transfer clients) (*)!✦ Gratia (accounting)!✦ Network debugging tools (e.g., iperf, nuttcp).!✦ VOMS / VOMS-Admin (authorization framework).!✦ frontier-squid (specialized HTTP proxy configuration)!✦ glexec (user switching)

27

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Major Internal Software Components

✦ RSV (monitoring)!✦ CA bundle!✦ List of VOMS servers!✦ GratiaWeb (accounting web interface)!✦ osg-info-services!✦ Internal build tools!✦ Internal testing tools!✦ osg-configure!✦ osg-pki-tools

28

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Minor components

✦ We expect items marked with (*) to decrease in support costs over the next 2 years due to improvements in upstream packaging.!

✦ CCTools!✦ edg-gridftp-client (*)!✦ edg-mkgridmap!✦ frontier-squid!✦ GIP (*)!✦ Meta packages (groups of packages containing only dependencies to

other packages; i.e., “osg-ce”)!✦ pakiti (*)!✦ pegasus!✦ uberftp (*)!✦ Xrootd plugins

29

Brian Bockelman/UNL OSG Agency Review August 19, 2014

Major Orphaned Software Components

✦ GUMS (Site authorization management).!✦ bestman2 (SRM protocol implementation for POSIX filesystems).!✦ jglobus (Java implementation of grid security libraries).

30

Brian Bockelman/UNL OSG Agency Review August 19, 2014

OASIS

✦ The OASIS service originally provided a all-in-one hosted CVMFS server for smaller VOs to distribute software.!

✦ Shared login server, Stratum 0 / repo, and Stratum 1 infrastructure.!★ Stratum-1 infrastructure also

used by WLCG experiments.

31

GOCOASIS Today

Login Host Stratum-0 and Repo Host

GSISSH

Install Directory

Install Directoryrsync

Web directory

publish

Repo Keysign

Master Key

sign

Stratum-1

Web Directory

FNAL

Stratum-1

Web Directory

CERN

Stratum-0 and Repo Host

Web directory

Brian Bockelman/UNL OSG Agency Review August 19, 2014

OASIS

✦ Some VOs are outgrowing the shared environment.!

✦ Working to deploy external repos - VO runs the repo host and manages software installs, but OSG still signs and runs remaining infrastructure.!

✦ Still limited to what is possible with CVMFS.

32

GOCOASIS Year 3

Login Host Stratum-0 and Repo Host

GSISSH

Install Directory

Install Directory

Web directory

Repo Key

Master Key

Stratum-1

Web Directory

FNAL

Stratum-1

Web Directory

CERN

Stratum-0 and Repo Host

Web directory

Repo Host

Web directory

Repo Key

publish

Install Directory

sign

sign

Brian Bockelman/UNL OSG Agency Review August 19, 2014

OASIS - Investigations

✦ HTTP caches basically limit the working set size to a few GB.!✦ Instead of worker node disk, CVMFS can keep it a shared cache on

the site’s distributed file system (HDFS, GPFS, Lustre, etc).!★ Takes advantage of the site’s storage while keeping the global

consistency model of CVMFS.

33