Brian Bockelman/UNL OSG Agency Review August 19, 2014
Technology and Software
Meeting research needs through advances, delivery and support.
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Overview
✦ The Technology and Software team focuses on providing a production-quality set of software to users and resource providers for DHTC and evolving the technology used by the OSG in the medium term.!★We focus on integrating technologies from external projects as opposed to developing
our own software.!★We work on advancing and refining the science of DHTC (as we have throughout the
lifetime of OSG).!✦ Technology is a balancing act between the torrent of new technologies and
maintaining older capabilities.!★We also must balance between meeting stakeholder needs and progressing our
internal vision.!★ By continuing to simplify the existing OSG Software Stack, we will to address new
challenges in Year 4 and 5 but stay within the same effort profile.!✦ The evolution the OSG Software Stack helps us keep abreast of the state of the
art in DHTC. All technology has a lifecycle in the OSG:!★We investigate potential new - sometimes immature - technologies that we believe will
improve DHTC.!★We integrate and release software into the OSG Software Stack. !★We maintain, deprecate, and remove software at the end of its lifecycle.
2
Brian Bockelman/UNL OSG Agency Review August 19, 2014
OSG Technology Area
✦ The investigations team (1.8 FTE): Focuses on identifying and introducing technologies that could be disruptive in the 6 months to 2 year timescale.!
✦ The software team (5.8 FTE): Responsible for maintaining the software stack. Ideally mostly maintenance (bug fixes), nightly testing, and integration.!★ This is also the effort which performs development as
necessary.!✦ The release team (1.5 paid from project funds; 1.9 total
FTE): Responsible for the monthly release process - including acceptance testing, documentation, and release management.!★ The responsibility for release management was moved to this
separate team in 2013..
3
Decreasing tim
e horizon
A success in this phase of the OSG has been to integrate these three responsibilities into a single software area.
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Beginning of the Software Lifecycle
✦ I’ll now give three examples of new technologies and efforts from the last three years: HTCondor-CE, removing user certificates, and simplifying data management.!
✦ Focus is to significantly simplify either the user experience or site operations.!
✦ Onboarding new technologies is not a simple process:!★Can it solve our problem? Does it fit into the vision of DHTC? Can we
support it? Does it fit with our existing software solution? Will shortcomings be fixed?!
✦ Even 8 years into the project, this remains an essential activity - many early technical decisions were made for us by the stakeholders. We must stabilize and lead, changing the decisions to help DHTC succeed for WLCG and the long tail of science.
5
Brian Bockelman/UNL OSG Agency Review August 19, 2014
New Components - HTCondor-CE
✦ The Compute Element (CE) is the set of software which accepts pilot jobs from offsite and submits them to a batch system.!
✦ An internal review of our software components indicated our original core technology (GRAM) had a declining support profile, and we were unique in the scale of its use.!★We had the opportunity to consolidate the
set of technologies utilized.!✦ We are in the process of deploying a
new CE, HTCondor-CE. This is a particular configuration of HTCondor - not a new piece of software.!
✦ HTCondor-CE is more scalable, more robust, easier to debug, and thus less costly to operate.
6
Brian Bockelman/UNL OSG Agency Review August 19, 2014
OSG Security - Evolving Trust Relationships
✦ In the old model, users needed a credential (grid certificate) globally valid and understood by all resources. In the new model, only the VO needs such a credential.!
✦ As we are able to bring our overlay approach into production, it enables the new trust model.
7
Old Model
New Model
Resource
ResourceVO
User
User
User
User
User
User
Resource establishesdirect trust with users
Trusts theVO
Trusts theusers
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Traceability Project
✦ Trust is now recognized as transitive - resources trust the VO; the VO trusts the users; therefore, the resources trust the users.!★ A simpler trust model helps the resource provider share resources!!
✦ Resources (especially DOE) often have a traceability requirement - they need to demonstrate they know who is using the resource.!★ Fulfilling this was easier in the old model where users established a
relationship directly with the resource.!✦ OSG’s traceability project helps establish the trust in the new model.
Through technical assistance, audits, and VO reviews, we show resources can fulfill this requirement for certain VOs.!★ First system studied was GlideinWMS; primary beneficiaries have been
the OSG and GLOW VOs.!★ This enables sites with strict traceability requirements get out from
underneath a user-unfriendly model.
8
Brian Bockelman/UNL OSG Agency Review August 19, 2014
The Long Tail of Science: Simplifying Software Management
✦ Software installation has long been a serious problem - originally, we asked sites to provide a NFS mount. VOs were responsible for keeping software installs synchronized across dozens of sites.!
✦ The WLCG introduced CVMFS, a read-only global file system which distributes files through a network of HTTP caches.!★OSG identified the potential of the concept and developed/deployed CVMFS
as a service for our VOs, branded OASIS. Took close collaboration between Technology and Operations.!
★ In 2013, VOs could install their software once at GOC and publish everywhere. Starting in 2014, VOs can manage their own repositories.!
★ The network of HTTP caches is built on common, widely-adopted HTTP technologies and is reused for other activities.!
✦ Software is data - just differing in scale! With the Intensity Frontier, OSG is working to evaluate alternate file distribution methods to see if CVMFS will scale for data management (targeting dataset sizes of 100GB-1TB).!★We believe there is opportunity in treating this data in a uniform way.
9
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Middle of the Software Lifecycle
✦ The middle of the software lifecycle - mature software in production at sites - is the majority of our effort.!
✦ Focus is on keeping production quality:!★Release software as part of a cohesive, consistent set of packages.!★ Test and verify new software versions. Does it still work in concert with
our other software?!★ Attempt to get new features to sysadmins in a timely manner.!★Coordinate with the software developers to solve issues at sites and
organize feature requests.
11
Brian Bockelman/UNL OSG Agency Review August 19, 2014
The Software Factory
✦ One important service OSG provides is a “software factory” - the software, services, and processes for a coherent software distribution.!★ Logically, raw components (software packages) go in one side. A software
distribution comes out the other.!★We assemble/integrate the components, improve them, test them, and
distribute the results to the OSG Production Grid.!✦ We are expanding this to a “Software Factory Factory”; other organizations,
such as HTCondor, HCC, Internet2, and USCMS are investigating how to use our infrastructure to produce software distributions.!
✦ In the early days, many pieces of grid software were of “research project quality” with packaging not meeting our standards. Sustained efforts in packaging have paid off - most software is now available from widely-known community repositories or has packaging maintained by developers.!★ A key enabler was the transition from niche packaging format to OS-native
packaging tools. Software projects had no interest in maintaining “OSG packaging” but have more interest in maintaining quality “RedHat packaging”.
12
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Example: HTCondor
✦ OSG Software team adds 8 patches; for example,!★ Patch startup script to integrate OSG-used security libraries.!★ Ensure generated grid proxies are at least 1024 bits.!
✦ Integrates it with other software packages, such as:!★Globus GRAM to add hooks for the OSG environment in grid jobs.!★GlideinWMS for pilot jobs.!
✦ Automated nightly tests include:!★ Submit jobs directly to the batch system.!★HTCondor-G job -> GRAM -> HTCondor backend.!
✦ Contributed our packaging back to the HTCondor project - they now perform most of the maintenance for us.
13
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Consolidating Technology - HTCondor
✦ During this part of the lifecycle, investments in the technology may bear fruit across several areas.!
✦ A key new technology is the ability to submit jobs remotely using SSH instead of a dedicated gatekeeper.!★OSG Technology helped test and validate scale.!★ This shares common components (“blahp”) with HTCondor-CE.!★Now forms core of the campus grids solution.!
✦ While HTCondor is an external project, OSG is recognized as a flagship community promoting DHTC. As Miron has mentioned, community efforts are even paying off internationally. With OSG help,!★ In 2013, the UK WLCG Tier-1 switch their local batch to HTCondor.!★ In 2014 CERN selected HTCondor as their new site batch system.
14
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Software Releases
✦ The release team is responsible for the final product sysadmins see: they are the final safeguards for quality.!★ In 2013, we created this separate team led by the release manager (who,
organizationally, is part of Operations not Technology).!✦ Fixes are verified in as realistic environments as possible. If possible,
real jobs are used.!★ Acceptance testing is a very different environment from the software
team’s nightly tests.!✦ We target a monthly software release and have a second monthly date
set aside for urgent releases (security issues, critical bugs).!★ 35 releases since the beginning of the current grant.!★Our average rate is 4 releases per quarter.!
✦ Bug reports are only closed once the release team verifies a fix and performs the corresponding release.!★ 383 resolved in 2013; 240 to date in 2014.
15
Brian Bockelman/UNL OSG Agency Review August 19, 2014
The Software Orphanage
✦ When a key piece of software dies - runs out of funding, is abandoned by the developers, goes in an incompatible direction - OSG must shoulder the support costs in order to maintain quality or capability.!★ The set of software in this state is known as the “software orphanage”.!★ The goal of the software orphanage is to ease the pain of its removal from
OSG.!★We will maintain and update the software while we help stakeholders
replace it with something else.!✦ This presents a continuous struggle: without retiring old software, we
would end up spending all our effort here!
17
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Example: Bestman2
✦ The Bestman2 software implements a protocol called SRM (Storage Resource Management); essential for interoperability with the WLCG.!★ SRM has proven fairly resilient to attempts to retire it; as a niche grid
protocol, there’s no new implementations.!★ Its developers ran out of funding about three years ago.!
✦ In 2012 / 2013, we had to make a major investment in this software and its dependencies to support new signing algorithms for SSL.!★While the preference is to retire, this is an example of our ability to
maintain software for the stakeholders (who had no capability to do this themselves). This is an essential service.
18
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Outlook
✦ This is a large area - necessarily so, as we must cover technologies through their entire lifecycle.!★Our most visible “output” is an integrated stack of software for
accomplishing DHTC.!★Due to our software expertise, there’s significant draw on this area from
other areas. Technology intersects networking, security, user support, and campus grids; we often loan effort for specific projects.!
★We plan to continue the software evolution through the next two years. We will “draw down” effort on older components and move it into new challenges. We will decrease the number of software components we ship by 25% in the next two years.!
✦ The OSG Software Stack is moving to execute our DHTC vision while providing a production platform for existing users.
20
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Questions?
✦ Some additional material contained in backup slides; I’ll be happy to answer questions on that material too.
21
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Significant Accomplishments
✦ Completed the transition to native packaging formats (RPM).!★ Packaging infrastructure is now more friendly to outsiders.!★ Pacman retired at end of Year 1.!
✦ Re-organization: Consolidated separate software and technology teams!✦ Re-organization: Created a separate release team with release
manager part of OSG Operations.!✦ Established an (approximately) monthly release cadence: 15 releases in
Year 1, 15 releases in Year 2, 4 releases (in three months) in Year 3.!✦ Major software advances:!★ “Uneventful” major upgrades of most components.!★ Added support for SHA-2 security algorithms.!★ Transitioned from Java 6 to 7.!★Upgraded supported version of HDFS to v2.0!★Release of HTCondor-CE.!★ Transition to CVMFS-centric software delivery.
23
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Year 4 - Major Technical Work Items
✦ Add support for RedHat Enterprise Linux 7.!✦ Package PanDA, a major workload management system.!✦ Verify the components of the OSG Software Stack are ready for a 2x
increase of scale during Run II.!✦ Improve our ability to allow OSG users to seamlessly utilize XD
allocations.!✦ Investigate various mechanisms to improve data delivery to jobs.!✦ Design / implement a new OSG client to better operate in the “overlay
world”.!✦ Review technical documentation.!✦ Improve automated testing.!✦ Contribute additional software packaging to community repositories.
24
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Details of Software Release Process
✦ Software team receives request or notice need for update!✦ Create ticket in JIRA (ticketing system). This ticket is maintained
throughout process.!✦ Download new source code and/or packaging to local storage. Update
packaging as needed (at least, version info).!✦ Build new package in Koji (build system), goes into development repository.!✦ The software team member tests the build for major flaws, and possibly
basic functionality.!✦ Promote package into testing repository. In a clean virtual machine, the
automated testing software installs and tests the integrated software.!★ At this point, the package is handed to the release team.!
✦ One or more testers from the release team test the new package in real installs, fully integrated and under realistic usage.!
✦ Do a final round of pre-release testing that integrates all updates.!✦ Release the package into production as part of the monthly release. The
ticket in JIRA is closed.25
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Software Ticket Rates
26
Roughly, we have kept apace of new issues, especially as we have been fully staffed during 2014. JIRA has been an essential system for tracking effort
across the distributed team.
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Major External Software Components
✦ We expect items marked with (*) to decrease in support costs over the next 2 years due to improvements in upstream packaging.!
✦ Globus (GridFTP, GSI, GRAM). (*)!✦ HTCondor. (*)!✦ CVMFS.!✦ HDFS. (*)!✦ Xrootd.!✦ LCMAPS framework and plugins (client-side authorization).!✦ dCache transfer clients (*)!✦ lcg-utils (transfer clients) (*)!✦ Gratia (accounting)!✦ Network debugging tools (e.g., iperf, nuttcp).!✦ VOMS / VOMS-Admin (authorization framework).!✦ frontier-squid (specialized HTTP proxy configuration)!✦ glexec (user switching)
27
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Major Internal Software Components
✦ RSV (monitoring)!✦ CA bundle!✦ List of VOMS servers!✦ GratiaWeb (accounting web interface)!✦ osg-info-services!✦ Internal build tools!✦ Internal testing tools!✦ osg-configure!✦ osg-pki-tools
28
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Minor components
✦ We expect items marked with (*) to decrease in support costs over the next 2 years due to improvements in upstream packaging.!
✦ CCTools!✦ edg-gridftp-client (*)!✦ edg-mkgridmap!✦ frontier-squid!✦ GIP (*)!✦ Meta packages (groups of packages containing only dependencies to
other packages; i.e., “osg-ce”)!✦ pakiti (*)!✦ pegasus!✦ uberftp (*)!✦ Xrootd plugins
29
Brian Bockelman/UNL OSG Agency Review August 19, 2014
Major Orphaned Software Components
✦ GUMS (Site authorization management).!✦ bestman2 (SRM protocol implementation for POSIX filesystems).!✦ jglobus (Java implementation of grid security libraries).
30
Brian Bockelman/UNL OSG Agency Review August 19, 2014
OASIS
✦ The OASIS service originally provided a all-in-one hosted CVMFS server for smaller VOs to distribute software.!
✦ Shared login server, Stratum 0 / repo, and Stratum 1 infrastructure.!★ Stratum-1 infrastructure also
used by WLCG experiments.
31
GOCOASIS Today
Login Host Stratum-0 and Repo Host
GSISSH
Install Directory
Install Directoryrsync
Web directory
publish
Repo Keysign
Master Key
sign
Stratum-1
Web Directory
FNAL
Stratum-1
Web Directory
CERN
Stratum-0 and Repo Host
Web directory
Brian Bockelman/UNL OSG Agency Review August 19, 2014
OASIS
✦ Some VOs are outgrowing the shared environment.!
✦ Working to deploy external repos - VO runs the repo host and manages software installs, but OSG still signs and runs remaining infrastructure.!
✦ Still limited to what is possible with CVMFS.
32
GOCOASIS Year 3
Login Host Stratum-0 and Repo Host
GSISSH
Install Directory
Install Directory
Web directory
Repo Key
Master Key
Stratum-1
Web Directory
FNAL
Stratum-1
Web Directory
CERN
Stratum-0 and Repo Host
Web directory
Repo Host
Web directory
Repo Key
publish
Install Directory
sign
sign
Brian Bockelman/UNL OSG Agency Review August 19, 2014
OASIS - Investigations
✦ HTTP caches basically limit the working set size to a few GB.!✦ Instead of worker node disk, CVMFS can keep it a shared cache on
the site’s distributed file system (HDFS, GPFS, Lustre, etc).!★ Takes advantage of the site’s storage while keeping the global
consistency model of CVMFS.
33