12
Request Aggregation, Caching, and Forwarding Strategies for Improving Large Climate Data Distribution with NDN: A Case Study Susmit Shannigrahi Colorado State University [email protected] Chengyu Fan Colorado State University [email protected] Christos Papadopoulos Colorado State University [email protected] ABSTRACT Scientific domains such as Climate Science, High Energy Particle Physics (HEP) and others, routinely generate and manage petabytes of data, projected to rise into exabytes [26]. The sheer volume and long life of the data stress IP network- ing and traditional content distribution networks mechanisms. Thus, each scientific domain typically designs, develops, im- plements, deploys and maintains its own data management and distribution system, often duplicating functionality. Sup- porting various incarnations of similar software is wasteful, prone to bugs, and results in an ecosystem of one-off solutions. In this paper, we present the first trace-driven study that investigates NDN in the context of a scientific application domain. Our contribution is threefold. First, we analyze a three-year climate data server log and characterize data access patterns to expose important variables such as cache size. Second, using an approximated topology derived from the log, we replay log requests in real-time over an NDN simulator to evaluate how NDN improves traffic flows through aggregation and caching. Finally, we implement a simple, nearest-replica NDN forwarding strategy and evaluate how NDN can improve scientific content delivery. CCS CONCEPTS Networks Network architectures; Network de- sign principles; Network simulations; Network per- formance analysis; Network services; In-network pro- cessing; KEYWORDS Named Data Networking, NDN, Information Centric Net- working, Large Scientific Data, Network Simulations, Network Strategies Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the au- thor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICN ’17, September 26–28, 2017, Berlin, Germany © 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery. ACM ISBN 978-1-4503-5122-5/17/09. . . $15.00 https://doi.org/10.1145/3125719.3125722 ACM Reference Format: Susmit Shannigrahi, Chengyu Fan, and Christos Papadopoulos. 2017. Request Aggregation, Caching, and Forwarding Strategies for Improving Large Climate Data Distribution with NDN: A Case Study. In Proceedings of ICN ’17, Berlin, Germany, September 26–28, 2017, 12 pages. https://doi.org/10.1145/3125719.3125722 1 INTRODUCTION We are entering a new era of exploration and discovery in many fields, from climate science to high energy particle physics (HEP) and astrophysics to genomics, seismology, and biomedical research, each with its complex workflow requiring massive computing, data handling, and network capacities. The continued cycle of breakthroughs in each of these fields depends crucially on our ability to extract the wealth of knowledge, whether subtle patterns, small perturbations or rare events, buried in massive datasets whose scale and complexity continue to grow exponentially with time. In spite of technology advances, the largest data- and network-intensive programs including the Earth System Grid (ESGF) [11], the Large Hadron Collider (LHC) [10] program, the Large Synoptic Space Telescope (LSST) [12] and the Square Kilometer Array (SKA) astrophysics surveys [16], photon-based Sciences, the Joint Genome Institute appli- cations, and many other data-intensive emerging areas of growth, face unprecedented challenges: in global data distri- bution, processing, access and analysis, in the coordinated use of massive but still limited computing, storage and network resources, and in the coordinated operation and collabora- tion within global scientific enterprises each encompassing hundreds to thousands of scientists. The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled Model Intercomparison Project (CMIP) [35] to scientists all over the world. CMIP is a standard experimental framework for studying the output of coupled atmosphere- ocean general circulation models. This project facilitates assessment of the strengths and weaknesses of climate models which can enhance and focus the development of future models. For example, if the models indicate a broad range of values either regionally or globally, then scientists may be able to determine the cause(s) of this uncertainty. CMIP5 is the most current and extensive of the CMIPs [35]. The large volume of CMIP5 data already presents significant 54

RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

Request Aggregation, Caching, and Forwarding Strategies for

Improving Large Climate Data Distribution with NDN: A Case

Study

Susmit ShannigrahiColorado State University

[email protected]

Chengyu FanColorado State University

[email protected]

Christos PapadopoulosColorado State University

[email protected]

ABSTRACT

Scientific domains such as Climate Science, High EnergyParticle Physics (HEP) and others, routinely generate andmanage petabytes of data, projected to rise into exabytes [26].The sheer volume and long life of the data stress IP network-ing and traditional content distribution networks mechanisms.Thus, each scientific domain typically designs, develops, im-plements, deploys and maintains its own data managementand distribution system, often duplicating functionality. Sup-porting various incarnations of similar software is wasteful,prone to bugs, and results in an ecosystem of one-off solutions.

In this paper, we present the first trace-driven study thatinvestigates NDN in the context of a scientific applicationdomain. Our contribution is threefold. First, we analyze athree-year climate data server log and characterize data accesspatterns to expose important variables such as cache size.Second, using an approximated topology derived from the log,we replay log requests in real-time over an NDN simulator toevaluate how NDN improves traffic flows through aggregationand caching. Finally, we implement a simple, nearest-replicaNDN forwarding strategy and evaluate how NDN can improvescientific content delivery.

CCS CONCEPTS

• Networks → Network architectures; Network de-

sign principles; Network simulations; Network per-

formance analysis; Network services; In-network pro-

cessing;

KEYWORDS

Named Data Networking, NDN, Information Centric Net-working, Large Scientific Data, Network Simulations, NetworkStrategies

Permission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than the au-thor(s) must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected].

ICN ’17, September 26–28, 2017, Berlin, Germany

© 2017 Copyright held by the owner/author(s). Publication rightslicensed to Association for Computing Machinery.ACM ISBN 978-1-4503-5122-5/17/09. . . $15.00https://doi.org/10.1145/3125719.3125722

ACM Reference Format:

Susmit Shannigrahi, Chengyu Fan, and Christos Papadopoulos.2017. Request Aggregation, Caching, and Forwarding Strategiesfor Improving Large Climate Data Distribution with NDN: A Case

Study. In Proceedings of ICN ’17, Berlin, Germany, September

26–28, 2017, 12 pages.https://doi.org/10.1145/3125719.3125722

1 INTRODUCTION

We are entering a new era of exploration and discovery inmany fields, from climate science to high energy particlephysics (HEP) and astrophysics to genomics, seismology,and biomedical research, each with its complex workflowrequiring massive computing, data handling, and networkcapacities. The continued cycle of breakthroughs in eachof these fields depends crucially on our ability to extractthe wealth of knowledge, whether subtle patterns, smallperturbations or rare events, buried in massive datasets whosescale and complexity continue to grow exponentially withtime.

In spite of technology advances, the largest data- andnetwork-intensive programs including the Earth System Grid(ESGF) [11], the Large Hadron Collider (LHC) [10] program,the Large Synoptic Space Telescope (LSST) [12] and theSquare Kilometer Array (SKA) astrophysics surveys [16],photon-based Sciences, the Joint Genome Institute appli-cations, and many other data-intensive emerging areas ofgrowth, face unprecedented challenges: in global data distri-bution, processing, access and analysis, in the coordinated useof massive but still limited computing, storage and networkresources, and in the coordinated operation and collabora-tion within global scientific enterprises each encompassinghundreds to thousands of scientists.

The Earth System Grid Federation (ESGF) [11] hosts anddistributes approximately 3.5PB of climate data generated bythe Coupled Model Intercomparison Project (CMIP) [35] toscientists all over the world. CMIP is a standard experimentalframework for studying the output of coupled atmosphere-ocean general circulation models. This project facilitatesassessment of the strengths and weaknesses of climate modelswhich can enhance and focus the development of futuremodels. For example, if the models indicate a broad rangeof values either regionally or globally, then scientists may beable to determine the cause(s) of this uncertainty. CMIP5is the most current and extensive of the CMIPs [35]. Thelarge volume of CMIP5 data already presents significant

54

Page 2: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

55

Page 3: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

NDN for Large Scientific Data ICN ’17, September 26–28, 2017, Berlin, Germany

Intelligent clients require complex configuration:

ESGF provides Globus [4], a sophisticated client for high-speed transfers. However, Globus calls for an elaborate setup,is not as easily portable as bash scripts and requires complexauthentication mechanisms. So far, only a few ESGF nodeshave integrated Globus into their workflows. Note that evenwhen intelligent solutions are available, they are deployedfor a specific community. Due to the complexities associatedwith developing, maintaining, and configuring such intelligentsolutions, scientists are often reluctant to integrate them intotheir workflows, preferring simpler but less robust solutions.

ESGF does not exploit temporal locality of re-

quests: We noticed a significant amount of temporal localityamong the client requests. However, the IP model does notprovide request aggregation at the network layer. Currently,all requests must travel to the server, consuming considerablenetwork and server resources. We show in the later sectionsthat current request patterns are indeed aggregatable andcan reduce the load on the server. Currently, ESGF does notprovide any caching mechanism either in the network or theapplication layer. However, temporally close requests suggestcaching will be useful in reducing server and network load aswell as speeding up data delivery to the clients. Clients couldconfigure and maintain their individual caches, but this isyet another complex task.

While our analysis focuses on ESGF, we believe other sim-ilar data distribution systems can benefit from a commonframework at the network layer. We use the ESGF log todemonstrate and quantify improvements using three essen-tial NDN based functionality, namely request aggregation,caching in the network and configurable forwarding strategies.We investigate these in a large scale NDN simulation basedon a real ESGF access log and an approximated networktopology reconstructed from the log. We show that NDN canhelp improve data delivery to end clients and at the sametime reduce the load on servers and the network.

Other aspects of NDN, such as naming and packet forward-ing speed are important for data distribution. Fortunately,CMIP5 names are hierarchical, so we use them with onlyminor changes (see [25]). We do not address NDN packetforwarding performance in this paper since there is a large,ongoing effort from the community to improve it [32], [33].

3 RELATED WORK

Previous studies of NDN and CCN data distribution havetypically focused on a single aspect, either caching, strategy,or Interest aggregation. Our study investigates benefits ofthese elements together and is the first to use a real trace toevaluate NDN’s benefits to scientific workflows in all threedimensions.

Studies on caching such as [24], [13], [21], [37], [19] have ex-clusively focused on cache placement, cache replacement poli-cies, and improvements to network traffic through caching. Ahandful of studies has investigated Interest Aggregation [14], [15]and forwarding strategies [36].

The paper on Interest aggregation [14] has argued thataggregation does not work for real world traffic. However, inour study, we show that Interest aggregation can be usefulin some high traffic scenarios. We also show that Interest ag-gregation provides better value when combined with cachingand complementary intelligent network strategies.

There is very little prior work using actual network traces.Most studies use curated data produced using statisticaldistributions such as Zipf. While such studies provide us withinsights into NDN’s improvements, they are hard to modeland usually applicable for one particular type of workflow.Studies that have real user traffic [24], [21] have focusedexclusively on web traffic.

Our work investigates NDN from the perspective of ascientific data distribution system. Scientific traffic is differentfrom regular web traffic; the cumulative traffic volume is muchlarger, and the request patterns are highly localized. Unlikeprevious work that used traces that spanned a few weeks, ourtrace spans several years, which give us a better long termpicture of the traffic characteristics.

4 SCIENTIFIC DATA ACCESS

PATTERNS

Our server log was exported by the ESGF node at LawrenceLivermore National Laboratory (LLNL), which is part ofa federation of nodes interconnected through ESNet’s high-speed network [3]. The node serves climate data to scientistslocated across the globe. Each entry in the log represents afile download request and contains information such as therequester’s IP address, OpenID of the user, request times-tamp as the number of seconds since epoch, the name of therequested file, a success/failure code and file transfer size.

4.1 Request Counts and Locations

The log spans three years, from 2013 to 2016, and containsabout 18.5 million entries. Each entry represents an HTTPGET request for a single file. From the 18.5 million requestswe extracted a set of unique IP addresses, which we referas “clients”, and geolocated them using Maxmind City Data-base [23]. Figure 1 shows the locations of these clients.

We classify requests that failed to transfer any data (ortransferred zero bytes) as “failures”. We classify the remainingrequests into two categories: partial transfers, where trans-fer size is less than the requested file size, and completed

transfers, where transfer size is equal to the requested filesize.

Out of the 18.5 million requests, only 5.7 million are partialor completed; the remaining failed without transferring anydata. The log only provides a generic error code (-1) uponfailure, so we do not know the precise reason for such asignificant number of failures. Anecdotal evidence pointstowards authentication failures, server overload or user error.We initially theorized that failures correlate with geographyand connectivity. To test this theory, we plotted a failure heatmap shown in Figure 2. The figure indicates that failuresand partial downloads were noticeable in areas considered to

56

Page 4: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

57

Page 5: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

58

Page 6: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

59

Page 7: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

60

Page 8: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

61

Page 9: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

62

Page 10: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

63

Page 11: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

NDN for Large Scientific Data ICN ’17, September 26–28, 2017, Berlin, Germany

https://github.com/susmit85/icn17-simulation-scenario.

[9] TCP Tuning at ESNet.https://fasterdata.es.net/assets/fasterdata/JT-201010.pdf.

[10] Akesson, T. The atlas experiment at the cern largehadron collider.

[11] Cinquini, L., Crichton, D., Mattmann, C., Harney,

J., Shipman, G., Wang, F., Ananthakrishnan, R.,

Miller, N., Denvil, S., Morgan, M., et al. Theearth system grid federation: An open infrastructure foraccess to distributed geospatial data. Future GenerationComputer Systems 36 (2014), 400–417.

[12] Collaboration, L. D. E. S., et al. Large synop-tic survey telescope: dark energy science collaboration.arXiv preprint arXiv:1211.0310 (2012).

[13] Dabirmoghaddam, A., Barijough, M. M., and

Garcia-Luna-Aceves, J. Understanding optimalcaching and opportunistic caching at the edge ofinformation-centric networks. In Proceedings of the1st international conference on Information-centricnetworking (2014), ACM, pp. 47–56.

[14] Dabirmoghaddam, A., Dehghan, M., and Garcia-

Luna-Aceves, J. Characterizing interest aggrega-tion in content-centric networks. In IFIP NetworkingConference (IFIP Networking) and Workshops, 2016(2016), IEEE, pp. 449–457.

[15] Dehghan, M., Jiang, B., Dabirmoghaddam, A., and

Towsley, D. On the analysis of caches with pendinginterest tables. In Proceedings of the 2nd InternationalConference on Information-centric Networking (2015),ACM, pp. 69–78.

[16] Dewdney, P. E., Hall, P. J., Schilizzi, R. T., and

Lazio, T. J. L. The square kilometre array. Proceedingsof the IEEE 97, 8 (2009), 1482–1496.

[17] Eyring, V., Bony, S., Meehl, G. A., Senior, C. A.,

Stevens, B., Stouffer, R. J., and Taylor, K. E.

Overview of the coupled model intercomparison projectphase 6 (cmip6) experimental design and organization.Geoscientific Model Development 9, 5 (2016), 1937–1958.

[18] Fan, C., Shannigrahi, S., DiBenedetto, S.,

Olschanowsky, C., Papadopoulos, C., and New-

man, H. Managing scientific data with named datanetworking. In Proceedings of the Fifth InternationalWorkshop on Network-Aware Data Management (2015),ACM, p. 1.

[19] Fayazbakhsh, S. K., Lin, Y., Tootoonchian, A., Gh-

odsi, A., Koponen, T., Maggs, B., Ng, K., Sekar,

V., and Shenker, S. Less pain, most of the gain: Incre-mentally deployable icn. In ACM SIGCOMM ComputerCommunication Review (2013), vol. 43, ACM, pp. 147–158.

[20] Guok, C., Robertson, D., Thompson, M., Lee, J.,

Tierney, B., and Johnston, W. Intra and interdomaincircuit provisioning using the oscars reservation system.

In Broadband Communications, Networks and Systems,2006. BROADNETS 2006. 3rd International Conferenceon (2006), IEEE, pp. 1–8.

[21] Imbrenda, C., Muscariello, L., and Rossi, D. Ana-lyzing cacheable traffic in isp access networks for mi-cro cdn applications via content-centric networking.In Proceedings of the 1st international conference onInformation-centric networking (2014), ACM, pp. 57–66.

[22] Mastorakis, S., Afanasyev, A., Moiseenko, I., and

Zhang, L. ndnsim 2.0: A new version of the ndn simula-tor for ns-3. NDN, Technical Report NDN-0028 (2015).

[23] MaxMind, L. Geoip, 2006.[24] Olmos, F., Kauffmann, B., Simonian, A., and Car-

linet, Y. Catalog dynamics: Impact of content pub-lishing and perishing on the performance of a lru cache.In Teletraffic Congress (ITC), 2014 26th International(2014), IEEE, pp. 1–9.

[25] Olschanowsky, C., Shannigrahi, S., and Pa-

padopoulos, C. Supporting climate research usingnamed data networking. In Local & Metropolitan AreaNetworks (LANMAN), 2014 IEEE 20th InternationalWorkshop on (2014), IEEE, pp. 1–6.

[26] Overpeck, J. T., Meehl, G. A., Bony, S., and East-

erling, D. R. Climate data challenges in the 21stcentury. science 331, 6018 (2011), 700–702.

[27] Ozmutlu, H. C., Spink, A., and Ozmutlu, S. Anal-ysis of large data logs: an application of poisson sam-pling on excite web queries. Information processing &management 38, 4 (2002), 473–490.

[28] Ohara, R. B., and Kotze, D. J. Do not log-transformcount data. Methods in Ecology and Evolution 1, 2(2010), 118–122.

[29] Ren, Y., Li, J., Shi, S., Li, L., Wang, G., and Zhang,

B. Congestion control in named data networking–asurvey. Computer Communications 86 (2016), 1–11.

[30] Schneider, K., Yi, C., Zhang, B., and Zhang, L. Apractical congestion control scheme for named data net-working. In Proceedings of the 2016 conference on 3rdACM Conference on Information-Centric Networking(2016), ACM, pp. 21–30.

[31] Shannigrahi, S., Papadopoulos, C., Yeh, E., New-

man, H., Barczyk, A. J., Liu, R., Sim, A., Mughal,

A., Monga, I., Vlimant, J.-R., et al. Named datanetworking in climate research and hep applications. InJournal of Physics: Conference Series (2015), vol. 664,IOP Publishing, p. 052033.

[32] So, W., Narayanan, A., Oran, D., and Stapp, M.

Named data networking on a router: forwarding at20gbps and beyond. In ACM SIGCOMM ComputerCommunication Review (2013), vol. 43, ACM, pp. 495–496.

[33] Song, T., Yuan, H., Crowley, P., and Zhang, B.

Scalable name-based packet forwarding: From millionsto billions. In Proceedings of the 2nd InternationalConference on Information-centric Networking (2015),ACM, pp. 19–28.

64

Page 12: RequestAggregation,Caching,andForwardingStrategiesfor ...The Earth System Grid Federation (ESGF) [11] hosts and distributes approximately 3.5PB of climate data generated by the Coupled

ICN ’17, September 26–28, 2017, Berlin, Germany Susmit Shannigrahi, Chengyu Fan, and Christos Papadopoulos

[34] Strand, G. Community earth system model data man-agement: Policies and challenges. Procedia ComputerScience 4 (2011), 558–566.

[35] Taylor, K. E., Stouffer, R. J., and Meehl, G. A.

An overview of cmip5 and the experiment design.Bulletin of the American Meteorological Society 93, 4(2012), 485–498.

[36] Tortelli, M., Grieco, L. A., and Boggia, G. Per-formance assessment of routing strategies in named data

networking. In IEEE ICNP (2013).[37] Wang, Y., Li, Z., Tyson, G., Uhlig, S., and Xie,

G. Optimal cache allocation for content-centric net-working. In Network Protocols (ICNP), 2013 21st IEEEInternational Conference on (2013), IEEE, pp. 1–10.

[38] Zhang, L., Afanasyev, A., Burke, J., Jacobson, V.,

Crowley, P., Papadopoulos, C., Wang, L., Zhang,

B., et al. Named data networking. ACM SIGCOMMComputer Communication Review 44, 3 (2014), 66–73.

65