Cloud Computing

+ Mobile Cloud Storage Users 20

+ Containers in Cloud Architecture 81

SEPTEMBER 2014www.computer.org/cloudcomputing

Contents | Zoom in | Zoom out Search Issue | Next PageFor navigation instructions please click here

Contents | Zoom in | Zoom out Search Issue | Next PageFor navigation instructions please click here

http://www.computer.org/cloudcomputing

Submission deadline: 1 Mar 2015 • Publication date: July-Aug 2015

With the increasing popularity of cloud services and their potential to either be the target of or the tool used in a cybercrime activity, organizational cloud

service users need to ensure that their (cloud) data is secure, and in the event of a compromise, they must have the capa-bility to collect evidential data.

Surveillance of citizens by their governments is not new. The relatively recent revelations of Edward Snowden (former NSA contractor) of the extensive surveillance (including of cloud service providers and users) by NSA, however, reminded us of the need to balance a secure cloud computing system with the rights of individuals to privacy. This is further complicated by the need to protect the community from serious and organized crimes, terrorism, cyber-crime, and other threats to national security interests. This presents serious implications for the ability of governments to protect their citizens against cyber security threats. It remains an under-researched area due to the interdisciplinary challenges specifi c to this fi eld.

This special issue will focus on cutting edge research from both academia and industry on the topic of balancing cloud user privacy with legitimate surveillance and lawful data access, with a particular focus on cross-disciplinary research. For example, how can we design technologies that will enhance “guardianship” and the “deterrent” effect in cloud security at the same time as reducing the “motivations” of cybercriminals?

Topics of interest include but are not limited to:

• Advanced cloud security

• Cloud forensics and anti-forensics

• Cloud incident response

• Cloud information leakage detection and prevention

• Enhancing and/or preserving cloud privacy

• Cloud surveillance

• Crime prevention strategies

• Legal issues relating to surveillance

• Enhancing privacy technology for cloud-based apps

• High quality survey papers on the above topics are welcome.

Special Issue Guest EditorsKim-Kwang Raymond Choo, University of South Australia

Rick Sarre, University of South Australia

Submission InformationSubmissions will be subject to IEEE Cloud Computingmagazine’s peer-review process. Articles should be at most 6,000 words, with a maximum of 15 references, and should be understandable to a broad audience of people interested in cloud computing, big data, and related application areas. The writing style should be down to earth, practical, and original.

All accepted articles will be edited according to the IEEE Computer Society style guide. Submit your papers through Manuscript Central at https://mc.manuscriptcentral.com/ccm-cs. Contact the guest editors at [email protected].

Call for Papers

Legal Clouds: How to Balance Privacy with Legitimate Surveillance and Lawful Data Access

www.computer.org/cloudcomputing

qqM

Mq

qM

MqM

THE WORLD’S NEWSSTAND®

Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page

_____ _______________

qqM

Mq

qM

MqM



https://mc.manuscriptcentral.com/ccm-cs



mailto:[email protected]


http://cloudcomputing.ieee.org

http://www.qmags.com



EDITOR IN CHIEFMazin Yousif, T-Systems International, [email protected]

EDITORIAL BOARDZahir Tari, RMIT University

Rajiv Ranjan, CSIRO Computational Informatics

Eli Collins, Cloudera

Kim-Kwang Raymond Choo, University of South Australia

Ivona Brandic, Vienna University of Technology

David Bernstein, Cloud Strategy Partners

Alan Sill, Texas Tech University

Omer Rana, Cardiff University

Beniamino Di Martino, Second University of Naples

Samee Khan, North Dakota State University

J.P. Martin-Flatin, EPFL

Pascal Bouvry, University of Luxembourg

Laura Taylor, Relevant Technologies

STEERING COMMITTEEManish Parashar, Rutgers, the State University of New Jersey

Steve Gorshe, PMC-Sierra (Communications Society

liaison; EIC Emeritus IEEE Communications)

Carl Landwehr, NSF, IARPA (EIC Emeritus IEEE S&P)

Dennis Gannon, Microsoft

V.O.K. Li, University of Hong Kong

(Communications Society liaison)

Rolf Oppliger, eSecurity Technologies

Hui Lei, IBM

Kirsten Ferguson-Boucher, Aberystwyth University.

EDITORIAL STAFFBrian Kirk • Lead Editor • [email protected]

Joan Taylor • Content Editor

Lee Garber, Keri Schreiner, Jenny Stout

• Contributing Editors

Carmen Garvey, Jennie Zhu-Mai • Production & Design

Robin Baldwin • Senior Manager, Editorial Services

Evan Butterfield • Products and Services Director

Sandy Brown • Senior Business Development Manager

Marian Anderson • Senior Advertising Coordinator

CS MAGAZINE OPERATIONS COMMITTEEPaolo Montuschi (chair), Erik R. Altman, Maria Ebling, Miguel

Encarnação, Lars Heide, Cecilia Metra, San Murugesan, Shari

Lawrence Pfleeger, Michael Rabinovich, Yong Rui, Forrest

Shull, George K. Thiruvathukal, Ron Vetter, Daniel Zeng

CS PUBLICATIONS BOARDJean-Luc Gaudiot (VP for Publications), Alain April,

Laxmi N. Bhuyan, Angela R. Burgess, Greg Byrd,

Robert Dupuis, David S. Ebert, Frank Ferrante, Paolo

Montuschi, Linda I. Shafer, H.J. Siegel, Per Stenström

IEEE Cloud Computing (ISSN 2325-6095) is published quarterly by the IEEE Computer Society. IEEE headquarters: Three Park Ave., 17th Floor, New York, NY 10016-5997. IEEE Computer Society Publications Office: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720; +1 714 821 8380; fax +1 714 821 4010. IEEE Computer Society headquarters: 2001 L St., Ste. 700, Washington, DC 20036.

Subscription rates: IEEE Computer Society members get the lowest rate of US$39 per year. Go to www.computer.org/subscribe to order and for more information on other subscription prices.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________

____________



http://www.computer.org/subscribe





FEATURED ARTICLES

24 Guest Editors’ Introduction:

Securing Big Data Applications in

the Cloud

Bharat Bhargava, Ibrahim Khalil, and Ravi Sandhu

27 Enhancing Big Data Security with

Collaborative Intrusion Detection

Zhiyuan Tan, Upasana T. Nagar, Xiangjian He, Priyadarsi

Nanda, Ren Ping Liu, Song Wang, and Jiankun Hu

34 Risk-Aware Virtual Resource

Management for Multitenant

Cloud Datacenters

Abdulrahman A. Almutairi and Arif Ghafoor

46 Effi cient and Secure Transfer,

Synchronization, and Sharing of Big

Data

Kyle Chard, Steven Tuecke, and Ian Foster

56 Location-Based Security

Framework for Cloud Perimeters

Chetan Jaiswal, Mahesh Nath, and Vijay Kumar

65 Multilabels-Based Scalable Access

Control for Big Data Applications

Hongsong Chen, Bharat Bhargava, and Fu Zhongchuan

What will the future of cloud computing look like? What are some of the issues

professionals, practitioners, and researchers need to address when utilizing cloud

services? IEEE Cloud Computing magazine serves as a forum for the constantly

shifting cloud landscape, bringing you original research, best practices, in-depth

analysis, and timely columns from luminaries in the fi eld.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM







Reuse Rights and Reprint Permissions: Educational or personal use of this material is permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the copy; and 3) does not imply IEEE endorsement of any third-party products or services. Authors and their companies are permitted to post the accepted version of their IEEE-copyrighted material on their own Web servers without permission, provided that the IEEE copyright notice and a full citation to the origin al work appear on the first screen of the posted copy. An accepted manu-script is a version which has been revised by the author to incorporate review suggestions, but not the published version with copyediting, proofreading and formatting added by IEEE. For more information, please go to: http://www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html.

Permission to reprint/republish this material for commercial, advertising, or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to the IEEE Intellectual Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or [email protected]. Copyright © 2014 IEEE. All rights reserved.

Abstracting and Library Use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

IEEE prohibits discrimination, harassment, and bullying. For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html.

September 2014Volume 1, Issue 3


COLUMNS

4 News

In Brief

Lee Garber

8 From the Editor in Chief

A Focus on Security and

Privacy in the Cloud

Mazin Yousif

10 Cloud and the Government

FedRAMP: History and Future Direction

Laura Taylor

15 Standards Now

Cloud Standards and the

Spectrum of Development

Alan Sill

20 Cloud and the Law

Mobile Cloud Storage Users

Kim-Kwang Raymond Choo

72 What’s Trending?

Bringing Big Data Systems to the Cloud

Amandeep Khurana

76 BlueSkies

Application Security through

Federated Clouds

Paul Watson

81 Cloud Tidbits

Containers and Cloud: From LXC

to Docker to Kubernetes

David Bernstein

23 Advertising Index

45 IEEE CS Information

qqM

Mq

qM

MqM



_________

qqM

Mq

qM

MqM



http://www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html

mailto:pubs-permissions@@ieee.org

http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html


mailto:pubs-permissions@@ieee.org





In Brief

Lee Garber

IEEE Computer Society, [email protected]

AS MORE ORGANIZATIONS ADOPT THE CLOUD, NEW ISSUES WILL CONTINUE TO EMERGE. Each issue, IEEE Cloud Computingnews briefs looks at recent happenings and trends in the cloud world.

Support Grows for New Software

Approach that Could Boost

Cloud Computing

Major technology companies such as Amazon and Google are supporting Docker (www.docker.com), a new open source platform that could make it easier to run applications on multiple machines.

Developers use Docker to place ap-plications in software containers, which users can download across the Internet or on any private network and use on any Linux machine or cloud platform.

This would be huge benefit for cloud computing, which is often used to make applications that are kept on-line available to all types of computing devices. In fact, proponents note, this is one of cloud computing’s purposes.

They add that Docker will make developers’ lives much easier by let-ting them focus on designing programs

without worrying about the machine or platform on which they’ll run.

Containers aren’t new, but Docker claims its technology makes packaging applications and moving them among various types of machines easier.

The system consists of the Docker engine, a lightweight runtime and pack-aging tool; and the Docker hub, a cloud service for sharing applications and handling workflows.

According to Docker, about 14,000 applications are now using its contain-ers. eBay is using the system to test new software in its datacenters. And Google, which is trying to challenge Amazon’s dominance in the cloud computing mar-ket, is also working with Docker.

The technology isn’t without its concerns. For example, machines must download software enabling them to use the containers. The software is sup-posed to run the same way on all Linux versions, but this isn’t always the case. Some containers therefore might not run on all operating systems. Docker and its supporters say they are working on this.

In addition, some cloud service pro-viders are working on their own proprie-

tary application-portability technologies and thus might not adopt or might even oppose Docker.

Service Offers New Approach to

Cloud Security

A vendor has released a new open source program designed to let users se-curely store data in the cloud for future access without also having to place their private cryptographic keys there.

CloudFlare’s Keyless SSL lets us-ers store the private keys on an inter-nal, rather than a public-facing, server. The ability to better protect keys could overcome concerns that businesses that handle sensitive data—such as financial and healthcare companies—have about keeping data in the cloud.

Typically, firms using the cloud store private keys on the same public-facing server that handles Web traffic. However, this could let hackers access the key and compromise the security of data stored online.

In some cases, businesses use third parties to handle their SSL systems, including their keys. However, this places those keys out of the businesses’ control.

With CloudFlare’s new system, private SSL keys are maintained on customers’ internal servers, which can sit behind firewalls or be secured in other ways. Users install an agent on their servers to handle data-access re-quests. To protect the communications involved in the process, the system transmits and processes key-signing re-quests via an encrypted tunnel to the user’s server.

CloudFlare says it got the idea for the new product after being approached by financial institutions that had suf-fered cyberattacks.

The company plans to bundle Key-less SSL with its enterprise security service.

4 I EEE CLO U D CO M P U T I N G P U B L I S H ED BY T H E I EEE CO M P U T ER S O CI E T Y 2 3 2 5 - 6 0 9 5/ 14 /$ 31 . 0 0 © 2 0 14 I EEE

CLOUD NEWS

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________


http://www.docker.com





S EP T E M B ER 2 0 14 I EEE CLO U D CO M P U T I N G 5

Will Cloud Computing Close the

Tech Industry’s Gender Gap?

Intel and other companies are express-ing hope that the rise of cloud com-puting could attract more women to technology-related jobs.

The US Department of Labor pre-dicts that the increasing use of cloud computing technologies and services will create 1.4 million jobs domestically by 2020. However, US universities will provide enough graduates to fill only an estimated 29 percent of them. Intel says the need to make up the difference could provide a way to get more women interested in technology careers.

Currently, women hold only an esti-mated one-fourth of US computing and technical jobs. However, cloud comput-ing is a relatively new technology requir-ing different types of skills. Intel says this could attract women who might not have been interested in traditional computer technologies and could force companies to change their traditional hiring approaches.

To encourage this process, In-tel recently paid half of the registra-tion fee—which ranges from $1,395 to $1,595—for women attending the first IT Cloud Computing Conference in late October in San Francisco. The compa-ny also paid the entire fee for 50 female college students majoring in science, technology, engineering, or mathemat-ics (STEM).

This effort exposes women to tech-nology and gives them an opportunity to network and to meet professionals in the field, according to Intel, whose pres-ident, Renée James, is a woman.

Support for these types of efforts has come from the nonprofit Girls Who Code organization (http://girlswhocode.com), whose members include Adobe Systems, Amazon, AT&T, eBay, Face-book, Google, Intel, Microsoft, and Twitter.

Study: European Companies

Aren’t Taking Advantage of Cloud

Technology

Large corporations are having trouble finding enough IT workers with the ex-pertise necessary to meet their cloud computing goals.

Many companies, therefore, haven’t been able to fully adopt cloud technolo-gies. And their IT departments aren’t con-fident of their readiness to do so, according to a recent study by market research firm International Data Corp. (IDC).

IDC initially surveyed European firms and found that 56 percent of re-sponding IT departments can’t find quali-fied workers to support their cloud-related efforts. About 60 percent are having trou-ble improving the skills of current em-ployees so that they can help with tasks such as evaluating cloud service providers.

Only about 30 percent of European IT departments told IDC that they can determine the costs and benefits of their cloud projects well enough to justify them to management. And just 40 per-cent of companies say they use cloud technology extensively and effectively enough to gain a marketplace advantage.

All this is occurring as European enterprise spending on cloud services and technology has grown 25 percent during the past year. To determine if these problems are limited to Europe, IDC interviewed high-ranking staff at

about 1,100 companies worldwide and found similar issues.

IBM Uses the Cloud to Take

Analytics to the Masses

Typically, only large companies with the money to buy powerful computers and expensive software and hire specially trained personnel have been able to perform complex analytics on the huge amounts of data they collect. This has limited the adoption of sophisticated data analytics products.

Now, however, IBM is using its Watson supercomputer technology and the cloud to deliver such services to smaller organizations. Scientists and developers from IBM’s data analysis and Watson units worked on the Watson Analytics project for about a year before announcing it recently.

The system combines IBM’s data analytics approaches with Watson’s computing power and machine-learn-ing capabilities, as well as its ability to work with natural-language input. The latter enables company employees who aren’t data scientists to query databases to recognize useful patterns or derive helpful predictions from large amounts of corporate information.

The system can display results in formats such as text, charts, or graphs. It can also incorporate data about exter-nal factors to help with the process.

Intel and other companies are expressing hope that the rise of cloud computing could attract more women

to technology-related jobs.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___

http://girlswhocode.com

http://girlswhocode.com





6 I EEE CLO U D CO M P U T I N G W W W.CO M P U T ER .O R G /CLO U D CO M P U T I N G

CLOUD NEWS

Industry observers cite a need for services that can do what Watson Ana-lytics promises to do. However, they add, the offering’s success will ultimate-ly depend on factors such as reliability, ease of use, and the value of its results.

Security Experts: Hackers Stole

Nude Photos of Celebrities from

Apple’s iCloud

The unprecedented series of high-profile cybercrimes that began late last year may have moved into the cloud.

A possible attack on Apple’s iCloud cloud storage and cloud computing ser-vice has joined an ongoing series of hacks on major retailers such as the Target de-partment stores; JPMorgan Chase, the US’s biggest bank; the huge Home Depot home-improvement store chain; and gro-cery store groups across the United States.

In many of these cases, the attack-

ers stole customers’ sensitive personal data, including Social Security numbers and payment card information.

Recently, security researchers say, hackers broke into Apple’s iCloud ser-vice and stole nude photos, explicit vid-eos, and other personal material that 101 celebrities had loaded onto their iPhones and then stored in iCloud. The material was subsequently posted for sale on black market websites.

Security experts contend that the cybercriminals breached the iCloud ac-counts by exploiting a flaw in Apple’s Find My iPhone API. They say the API didn’t lock out people making more than a set number of failed attempts to log into accounts, as many applica-tions do for security purposes. This let the hackers keep trying possible pass-words—based on knowledge of the ce-lebrities—until they hit the right ones.

They then connected to iCloud and re-trieved various people’s iPhone backups.

Apple acknowledges security issues with Find My Phone and says it’s fixing them. But the company claims it isn’t responsible for the theft of celebrities’ personal material.

Instead, it contends, hackers either guessed celebrities’ passwords based on public information about them, or used phishing to send fake but legitimate-appearing emails that convinced celeb-rities to provide login information.

iOS 8 Bug Deletes iCloud

Documents

Users of iPhones and iPads running iOS 8 say an operating system flaw de-letes iWork documents from the iCloud Drive when they reset their devices.

On the MacRumors user-support discussion website, users reported that performing the “reset all settings” oper-ation removed word processing, spread-sheet, and presentation documents from the new iCloud Drive, which iOS 8 can use for storage and synchronization.

They complained that the dialog box at the start of the reset specifically said that the process would restore fac-tory settings—as a last resort to solve system problems—but not delete data. Some users stated that even the Apple Time Machine restoration application couldn’t recover the missing files, al-though one said it could.

Several people complained that Apple technical support representatives told them that, for example, the prob-lem was temporary or that the data was still on the device.

Now, however, some users say, it appears they will never recover the documents.

Apple introduced the iCloud Drive this year, saying it was an alternative to third-party cloud storage and synchro-nization services.

IEEE Pervasive Computing

Author guidelines:

www.computer.org/mc/pervasive/author.htm

Call for Articles

Further

details:

pervasive@

computer.org

www.

computer.

org/pervasive

seeks accessible, useful papers on the latest peer-

reviewed developments in pervasive, mobile, and

ubiquitous computing. Topics include hardware

technology, software infrastruc ture, real-world

sensing and interaction, human-computer

interaction, and systems considerations, including

deployment, scalability, security, and privacy.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________

http://WWW.COMPUTER.ORG/CLOUDCOMPUTING


http://www.computer.org/mc/pervasive/author.htm




http://www.computer.org/mc/pervasive





Submission deadline: 15 Jan 2015 • Publication date: Mar-Apr 2015

Cloud computing continues to increase in complexity due to both the increasing availability of confi guration options from public cloud providers and the increas-

ing variability and types of application instances that can be deployed on such platforms— for example, tuning options in hypervisors that enable different virtual machine instances to be associated with physical machines; storage, compute, and I/O preferences that offer different power and price; and operating system confi gurations that provide differing degrees of performance or security, etc. This complexity can also be seen in enterprise scale datacenters that dominate comput-ing infrastructures in industry, which are growing in size and complexity, leading to complex business applications and workfl ows. Autonomic computing enables self-management of systems and applications. The underlying concepts and mechanisms of autonomics can be applied to each compo-nent within a cloud system (resource manager/scheduler, power manager, etc.) as well as to the cloud system as a whole, or they can be applied within an application that makes use of such a cloud system. Autonomics can also play a critical role as applications explore dynamic federations of cloud infrastructure and services. We invite contributions that address a number of topics related to the use of autonomic computing approaches for managing cloud infrastructure, creating and managing federations of clouds infrastructure and services, and managing scientifi c applications hosted on a cloud infrastructure. Topics of interest include (but are not limited to):

• Auto-scaling strategies

• Adaptive deployment, confi guration, and management approaches

• Use of feedback and adaptive control strategies for cloud management

• Adaptive applications development

• Quality of service management

• Adaptive data management and processing on clouds

• Intrusion estimation and detection systems

• Autonomic federation of cloud infrastructure and services

• Platforms for autonomic applications

The guest editors invite original and high-quality research submissions addressing all aspects of this fi eld, as long as the connection to the focus topic is clear and emphasized. Experience reports, surveys, critical evaluations of the state of the art, and insightful analysis of established and up-coming technologies are also welcome.

Special Issue Guest EditorsManish Parashar, Rutgers University, USA

Javier Diaz-Montes, Rutgers University, USA

Omer Rana, Cardiff University, UK

Ioan Petri, Cardiff University, UK

Submission InformationSubmissions should be 4,000 to 6,000 words long and should follow the magazine’s guidelines on style and presentation. All submissions will be single-blind anonymously reviewed in accordance with normal practice for scientifi c publications. For more information, contact the guest editors at [email protected].

Authors should not assume that the audience will have specialized experience in a particular sub fi eld. All accepted articles will be edited according to the IEEE Computer Society style guide. Submit your papers to Manuscript Central at https://mc.manuscriptcentral.com/ccm-cs.

Call for Papers

IEEE Cloud Computing Magazine

Special Issue on Autonomic Clouds


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________










WELCOME TO THE THIRD ISSUE OF IEEE CLOUD COMPUTING, DEDICATED TO “SE-CURE CLOUD COMPUTING TECHNIQUES FOR BIG DATA.” Bharat Bhargava, Ibrahim Khalil, and Ravi Sandhu are the guest editors for this special issue.

Cloud architectures are well suited to big data deployments. To date, the bulk of the focus on this topic has been development of infrastructures, ana-lytics, and visualization. Although other concerns such as security and privacy have received less at-tention, they are rising in importance. Many com-mercially important big data applications need to share and process privacy-sensitive data. Increasing incidents of data misuse and data breach, distribut-

ed attacks aimed at privacy violations, and denial-of-service attacks also make it increasingly important to raise the security level and increase the focus on privacy protection in cloud settings for big data applications.

Given all this, cloud platforms need to be for-tified with robust security and privacy mechanisms to deliver reliable services. Specifically, this special issue aimes to address topics such as access control, encryption, collaborative threat detection using big data analytics, obfuscation, secure storage/retrieval for big data, watermarking of big data, and secure and efficient transmission of big data. These issues are discussed as the main focus of this effort to dis-seminate recent advances and stimulate future re-search directions in the specialized area of security and privacy within the context of big data applica-tions in a cloud environment.

The columns in this issue address a diverse range of topics. In “Standards Now,” Alan Sill pres-ents a general overview of APIs, protocols, pro-gramming languages, and tools and how they relate to cloud standardization. In “Cloud and the Law,” Raymond Choo looks at issues pertinent to mobile cloud storage users. Paul Watson of Newcastle Uni-versity guest authors the “Blue Skies” department, in which he explores ways to achieve application se-curity through hybrid and federated clouds. David Bernstein, in the “Cloud Tidbits” column, covers the role of containers in cloud architecture from LXC to Dockers to Kubernetes. Finally, “What’s Trending?” with guest author Amandeep Khurana of Cloudera, highlights issues for big data in public clouds.

In this issue, you will also see the first “Cloud and the Government,” in which column editor Laura Taylor provides a good overview of the Fed-eral Risk and Authorization Management Program (FedRAMP).

It is my pleasure to introduce two new columns to IEEE Cloud Computing’s roster. Both will first appear in the magazine’s next issue. The first col-umn, “Cloud Economics,” will be led by Joe Wein-man (see Weinman’s bio in the sidebar), and will cover cloud-economics related topics such as value chain, revenue models, and pricing models.

The second column is the “Cloud Community Corner,” which will cover various topics important to the cloud community, such as recent results from

MAZIN YOUSIFT-Systems International

[email protected]

A Focus on

Security and

Privacy in the

Cloud

FROM THE EDITOR IN CHIEF

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________







cloud-themed conferences, book reviews, and in-teresting observations from thought leaders in the cloud community. The magazine’s entire editorial board will contribute to this column, and we invite your submissions of items that can be brought to the attention of the community.

Stay tuned for our next issue, a special issue on “Cloud-Based Smart Evacuation Systems for Emer-gency Management,” which will be available in late December.

MAZIN YOUSIF is the editor in chief of IEEE Cloud Computing. He’s the chief technology officer and vice president of architecture for the Royal Dutch Shell Global account at T-Systems, International. Yousif has a PhD in computer engineering from Pennsylva-nia State University.

INTRODUCING COLUMNIST JOE WEINMANoe Weinman is the author of Cloudonomics: The Business

Value of Cloud Computing (Wiley, 2012, English, and PTPress, 2014, Chinese), which examines private and public cloud cost and performance

optimization from a quantitative perspective. Weinman is also the author of the forthcoming book, Digital Disciplines (Wiley CIO), which focuses on how IT can invigorate business strategy through better processes, products, customer relationships, and innovation.

Weinman is currently the chair of the IEEE Intercloud Testbed executive committee, an analyst for GigaOm Research, and serves on the advisory

boards of several technology companies. Previous-ly, he has held executive positions at Bell Labs, AT&T, HP, and Telx. Among other accolades, he has been recognized as a “Top 10 Cloud Computing Leader.”

Weinman has BS and MS degrees in computer science from Cornell University and the Univer-sity of Wisconsin-Madison, respectively, and has completed executive education at the International Institute for Management Development in Laus-anne. He has been awarded 21 patents in areas such as homomorphic encryption, pseudoternary line coding, adaptive bandwidth schemes, Web search, and distributed storage and computing.

We look forward to his contribution to IEEE Cloud Computing!

Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

NEWSLETTERSStay Informed on Hot Topics

computer.org/newsletters

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://ComputingNow.computer.org

http://www.computer.org/newsletters





LAURA TAYLORRelevant Technologies

[email protected]


THE FEDERAL RISK AND AUTHORIZATION MANAGEMENT PROGRAM (FEDRAMP), DE-VELOPED BY THE US GENERAL SERVICES ADMINISTRATION (GSA) IN CONJUNCTION WITH THE US OFFICE OF MANAGEMENT AND BUDGET (OMB) AND THE FEDERAL CIO COUNCIL, WAS LAUNCHED 6 JUNE 2012.Figure 1 provides a timeline of FedRAMP-related events starting from the announcement of the ini-tial working group through the two-year launch an-niversary. FedRAMP is the US government program to apply the Federal Information Security Manage-ment Act (FISMA) to cloud computing. Initially, skeptics warned that the program wouldn’t gain ac-ceptance and would become another government IT casualty. Yet FedRAMP has been so successful that

many governments in East Asia, Northern Europe, and the Americas are using it as a model for their own cloud security programs.

Cloud computing promises lower cost and the ability to quickly scale resources up or down as work-loads demand, leading organizations in both the pub-lic and private sectors to seriously consider moving their applications and data to the cloud. Concern about cloud security has been the number one ob-stacle to adoption, particularly in the public sector.

FedRAMP provides a comprehensive set of cloud security requirements and an independent assess-ment program backed by the chief information of-ficers (CIOs) of the Department of Defense (DoD), the Department of Homeland Security (DHS), and the GSA. Cloud service providers (CSPs) that imple-ment the required security controls and meet inde-pendent assessment requirements can be authorized for use by the federal government. There’s no short-age of CSPs jockeying for what has become the most coveted and prestigious qualifier of cloud security. So far, more than 50 CSPs have either been autho-rized, or are far enough into the process that the FedRAMP website lists them as “in process.”

Standardizing the Authorization Process

Since its 2002 launch, FISMA has required that all systems hosting US government data be autho-rized prior to being put into production. The autho-rization process is extremely comprehensive, and until FedRAMP came along, system owners had to go through the entire authorization process for each agency using their system, even if the system was exactly the same from one agency to another. FedRAMP standardized the process such that au-thorizations can be performed once and reused by multiple agencies. It saves both government and private sector CSPs a lot of time and money and enables fast adoption of new systems and services. According to the FedRAMP program management office (PMO), Amazon estimates that its FedRAMP authorization saves approximately $250,000 per as-sessment. The FedRAMP PMO estimates that as-sessments cost the US government approximately $250,000. With the launch of FedRAMP, now CSPs are paying for the assessments instead of the U.S. government. The authorized cloud systems cover at least 160 known FISMA implementations across the

FedRAMP:

History

and Future

Direction

CLOUD AND THE GOVERNMENT

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________







government, giving current FedRAMP cost savings a conservative estimate of $40 million dollars.

When he became the first federal CIO, Vivek Kundra championed cloud adoption as a way for agencies to save re-sources and improve service. However, without a way to secure the cloud and enable FISMA authorizations, cloud adoption would not come easily. Kundra launched the Federal Cloud Computing Working Group under the Federal CIO Council, a group of government CIOs that meets regularly to discuss govern-ment IT initiatives. At one of the coun-cil meetings, GSA CIO Casey Coleman volunteered GSA to take the lead in ad-dressing federal agencies’ adoption of cloud computing. Coleman in turn ap-pointed her chief of staff, Katie Lewin, to manage the effort. In April 2009,

GSA established the Cloud Computing PMO, and Lewin became the Federal Cloud Computing Initiative Director.

In addition to FedRAMP, Lewin was charged with heading up the Fed-eral Data Center Consolidation Ini-tiative (FDCCI). According to Lewin, “The FDCCI project was really the camel’s nose under the tent for launch-ing government cloud and ultimately FedRAMP.” Lewin ensured that the brain trust at the National Institute of Standards and Technology (NIST) was involved with FedRAMP from the start.

The Federal Cloud Computing Working Group was initially chaired by Peter Mell. Mell was part of the NIST Information Technology Laboratory in the Computer Security Division and be-came involved in the working group after writing the technical definition of cloud

computing adopted by the government cloud program.1 In fall 2009, the group identified cloud authorization as the largest security hurdle to government cloud adoption. To address this, Mell conceived of the notion of government-wide authorization and worked out a formal process with his NIST colleagues (such as Ron Ross). He presented “A No-tional Process on Security Assessment and Authorization for Cloud Computing Systems” to the working group and the cloud PMO. It was well received and, in early 2010, Lewin worked with Mell to present the idea to Kundra and then to the CIO council.

Forming a Cloud Policy

To pitch the idea, they needed a name. The acronym FedRAMP appeared on a paper plate next to Mell’s sandwich

Q109

Q209

Q309

Q409

Q110

Q210

Q310

Q410

Q111

Q211

Q311

Q411

Q112

Q212

Q312

Q412

Q113

April 2009Cloud Computing Program Management Office established

February 2010FedRAMPconceptannounced

March 2009Cloud Computing Program launchedExecutive Steering Committee established

October 2009Security working group established

July – Sept. 2010FedRAMPconcept vettedwith industryand government

November2010FedRAMPconcept,controls, and templatesreleased

June 2010 FedRAMP drafts initial baseline

February 2012 FedRAMPCONOPSpublished

December 2012 JAB grants first provisionalauthorization

May 20123PAOsaccredited

July - Sept. 2011 FedRAMP drafts initial baseline

December 2010Cloud Computing Program Management Office established

January 2011More than 1,200 public comments received

Feb – Mar 2011Government Tiger teams review comments

Apr – June 2011Executive team FedRAMP policy solidifies Tiger signed team recommendations

December 2011FedRAMP policy signed

June 2012FedRAMPlaunches initial operationalcapability

January 2013JAB grants secondprovisionalauthorization

FIGURE 1.Federal cloud computing initiative and FedRAMP timeline. (Source: FedRAMP)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









one day as he listed descriptive words for government-wide authorization pro-grams. The logo (similar to the one used today) was a result of an internal secu-rity working group competition. The idea was adopted and the admittedly slow process of creating the first government-wide authorization process began.

In December 2010, Kundra published the 25 Point Implementation Plan to Re-form Federal Information Technology Management.2 The plan announced the cloud first policy, which stated, “When evaluating options for new IT deploy-ments, OMB will require that agencies default to cloud-based solutions whenever a secure, reliable, cost-effective cloud op-tion exists.”2 In February 2011, Kundra released the Federal Cloud Computing Strategy, which stated that government agencies must focus on managing ser-vices rather than assets.3 In this paper, Kundra estimated that $20 billion of the then $80 billion in IT spending could be migrated to the cloud. Kundra forecast

that by moving to the cloud, government agencies could improve server utilization by 60 to 70 percent and could increase re-sponsiveness to urgent agency needs. The stage was set and government agencies would have to start migrating to cloud, like it or not (see Figure 1 for a chronol-ogy of events).

In August 2011, former Microsoft executive Steve Van Roeckel succeeded Kundra as federal CIO. Van Roeckel es-tablished FedRAMP via the “Security Authorization of Information Systems in Cloud Computing Environments” memorandum issued on 8 December 2011 (see https://cio.gov/wp-content/uploads/2012/09/fedrampmemo.pdf), which provided a cost-effective, risk-based approach for the adoption and use of cloud services. Under Lewin’s leader-ship, the FedRAMP PMO ramped up quickly on resources when an OMB ex-aminer transferred resources from the GSA Federal Acquisition Services (FAS) office to Lewin. Lewin hired Matthew

Goodrich as deputy program manager. Lewin retired from the government in 2013, and Goodrich is the current act-ing FedRAMP director.

Although FedRAMP has attracted Amazon, Microsoft, HP, IBM, AT&T, and other big players, the first CSP to be authorized was Autonomic Resources, a government-only CSP headquartered in Research Triangle Park, North Caro-lina. Autonomic Resources predicted early that FedRAMP and DoD authori-zations would be a boon to business and built a government cloud specifically for FedRAMP authorization. According to James Bowman, Autonomic Resources’ government compliance director, “The ARC-P IaaS government community cloud solution was designed and built from the ground up to meet the strin-gent FedRAMP and DoD security con-trol requirements. Our value lies in our cost savings, custom-built cloud servic-es for government, and our high level of security and compliance.”

Table 1. FedRAMP preparation requirements.

Checklist Description

1 You have the ability to process electronic discovery and litigation holds

2 You have the ability to clearly define and describe your system boundaries

3 You can identify customer responsibilities and what they must do to implement controls

4 System provides identification & 2-factor authentication for network access to privileged accounts

5 System provides identification & 2-factor authentication for network access to non-privileged accounts

6 System provides identification & 2-factor authentication for local access to privileged accounts

7 You can perform code analysis scans for code written in-house (non-COTS products)

8 You have boundary protections with logical and physical isolation of assets

9 You have the ability to remediate high risk issues within 30 days, medium risk within 90 days

10 You can provide an inventory and configuration build standards for all devices

11 System has safeguards to prevent unauthorized information transfer via shared resources

12 Cryptographic safeguards preserve confidentiality and integrity of data during transmission

Source: Guide to Understanding FedRAMP5

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________

__________________________


https://cio.gov/wp-content/uploads/2012/09/fedrampmemo.pdf






Achieving FedRAMP

Authorization

FedRAMP doesn’t certify or authorize products of any kind. Rather, it aims to verify public and private cloud sys-tems’ security through FISMA. All US government clouds, private and public, must comply with FedRAMP. A system must already be built for its security to be verified. FedRAMP doesn’t care what products you use to build your cloud as long as the system is secure, and as long as it meets the FedRAMP security control baseline. The Joint Au-thorization Board (JAB, which consists of the DoD, DHS, and GSA) selected the controls from NIST SP 800-53, Se-curity and Privacy Controls for Federal Information Systems and Organizations.4

The Guide to Understanding FedRAMP,V2.0, June 6, 2014 includes a prepara-tion checklist (see Table 1).5 If a CSP system can’t at the minimum meet these requirements, it isn’t a suitable candidate for FedRAMP.

Before FedRAMP, the authorization process inherently had many redundan-cies that duplicated authorization work from one agency to another. One agency didn’t necessarily trust another agency’s authorization process because it used different controls and security tem-plates, and the independent assessment process differed from agency to agency. Even if an agency had authorized a cloud platform, each time a new agency want-ed to use that platform, the CSP had to go through the authorization process all over again, as Figure 2a illustrates.

With the advent of FedRAMP, agen-cies now use the same security control baseline, the same security templates, and the same independent assessment process as illustrated in Figure 2b. The new process ensures consistency across all government agencies and instills a reciprocity of trust between agencies. Once a CSP has been authorized, any

agency can leverage that authorization without repeating the process.

This new approach speeds up an agency’s ability to roll out cloud services while reducing the cost of the authori-zation. The Department of Health and Human Services (USDA), the Depart-ment of Transportation (DOT), the De-partment of Agriculture (USDA), and the Department of Housing and Urban Development are at the forefront of cloud adoptions.

CSPs can use three different avenues to become authorized under FedRAMP. They can be authorized by the JAB or by an agency directly, or a CSP can self-

submit a security package as a candidate for authorization. The FedRAMP web-site lists the three security package types. As of this writing, no CSP self-submitted packages are listed, although multiple CSPs are currently putting to-gether packages in that category.

Once an agency decides to autho-rize a candidate package, the package moves to the “agency authorization” category on the FedRAMP website. The primary difference between an agency authorized package and a JAB authorized package is the level of re-view it undergoes. Agency-authorized packages are reviewed by one agency,

Federal agencies Cloud providers

Federal agencies Cloud providers

Risk management

(a)

(b)

Risk management

FIGURE 2. Authorization process for federal clouds: (a) old way and (b) new way.

(Source: FedRAMP)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM









whereas JAB-authorized packages are reviewed by the DHS, DoD, and GSA CIOs and their technical teams. The JAB’s technical review teams consist of up to a dozen people from DoD, DHS, and GSA, all looking at the Security As-sessment Report from different angles. Because of the number of people that review packages slated for JAB autho-rization, it can take considerably longer to get through the FedRAMP process if going through the JAB. Once a secu-rity package is listed in the FedRAMP repository, federal agencies can review it to determine if they want to use the system described in the package.5 Fig-ure 3 summarizes the three FedRAMP security package types described above

CSPs should not presume that their work is done after their system has been authorized. Continuous monitor-ing is required. According to Goodrich, “What we’ve seen at FedRAMP is that the hard part of security is people and processes, not the technology. The alignment of business processes like configuration management and patch management with vulnerability scan-ning is critical to a successful imple-mentation of security on all systems.” Authorized CSPs must perform month-ly scans and send the scan results to their government authorization point of contact. High vulnerabilities must be mitigated within 30 days and mod-erate vulnerabilities must be mitigated

within 90 days. Failure to mitigate vul-nerabilities according to these require-ments could lead to a CSP having its authorization suspended or revoked. FedRAMP’s Continuous Monitoring Strategy Guide is available on the Fe-dRAMP website.6

FEDRAMP WILL CONTINUE TO EVOLVE ITS PROGRAM AND PRO-CESSES OVER TIME. Check in at www.fedramp.gov for the latest updates.

References

1. P. Mell and T. Grance, The NIST Def-inition of Cloud Computing, National Inst. of Standards and Technology, NIST Special Publication 800-145, Sep. 2011, Nat’l Inst. of Standards and Technology; http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf.

2. V. Kundra, 25 Point Implementa-tion Plan to Reform Federal Informa-tion Technology Management, 9 Dec. 2010, The White House; https://www.dhs.gov/sites/default/files/publications/digital-strategy/25-point-implementation-plan-to-reform-federal-it.pdf.

3. V. Kundra, Federal Cloud Comput-ing Strategy, 8 Feb. 2011, The White House; https://www.dhs.gov/sites/default/f iles/publications/digital-strategy/federal-cloud-computing-strategy.pdf.

4. Joint Task Force Transformation Initiative, Security and Privacy Con-trols for Federal Information Systems and Organizations, NIST Special Publication 800-53, rev. 4, Nat’l Inst. of Standards and Technology, Apr. 2013; http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-53r4.pdf.

5. Guide to Understanding FedRAMP,Federal Risk and Authorization Management Program (FedRAMP) V2.0, 6 June 2014; http://cloud.cio.gov/document/guide-understanding-fedramp.

6. Continuous Monitoring Strategy Guide, Federal Risk and Authori-zation Management Program (Fe-dRAMP) V2.0, 6 June 2014; http://cloud.cio.gov/document/continuous-monitoring-strategy-guide.

LAURA TAYLOR is the founder of Rel-evant Technologies and the chair of the FISMA Center’s Advisory Board. She specializes in security compliance and security audits of government agencies and financial institutions. Taylor has provided information security consulting services to some of the largest financial institutions in the world including the IRS, the US Treasury, the US govern-ment-wide accounting system, and vari-ous regional banks. She has also served as director of security research at TEC, chief information officer of Schafer Cor-poration, director of information security at Navisite, director of certification and accreditation at COACT, and director of security compliance at USfalcon.

Increasedlevel ofreview

Category Description

CSP CSP supplied, not yet reviewed (candidate for authorization)

Agency Reviewed and authorized by agency

JAB Reviewed by FedRAMS ISSO and JAB, authorized by JAB

FIGURE 3. Summary of FedRAMP packages.


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________

____

_________

________________________

________________________

_________________

______________________

________________________

________

__________

__________________

______

____

________________________

___________


http://www.fedramp.gov

http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

https://www.dhs.gov/sites/default/files/publications/digital-strategy/25-point-implementation-plan-to-reform-federal-it.pdf

https://www.dhs.gov/sites/default/files/publications/digital-strategy/federal-cloud-computing-strategy.pdf

http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-53r4.pdf

http://cloud.cio.gov/document/guide-understanding-fedramp

http://cloud.cio.gov/document/continuous-monitoring-strategy-guide






2 3 2 5 - 6 0 9 5/ 14 /$ 31 . 0 0 © 2 0 14 I EEE S EP T E M B ER 2 0 14 I EEE CLO U D CO M P U T I N G 15

ALAN SILLTexas Tech University, [email protected]

STANDARDS NOW

IN MODERN DEVELOPMENT ENVIRON-MENTS, INNOVATION MUST BE BUILT IN EX-PLICITLY AND NOT EXTERNALLY IMPOSED. Successful cloud computing methods are charac-terized by their intrinsic utility, capacity for highly scaled implementation, and ability to adapt to rapid change. These methods can be classified for practi-cal purposes in terms of APIs, protocols, languages, and tools.

Classic approaches to producing standards are as archaic when applied to cloud computing as last century’s lighting, transportation, and communica-tion systems are to the rest of our society’s infra-structure. For standards to work and be suitable in this new setting, we need an approach that promotes rapid feedback and simultaneous or near-simultane-ous development and implementation.

In earlier columns, I’ve explored the role of com-munities in developing cloud standards and laid out the landscape of the organizations operating in this space. I’ve also argued that standards are part of a continuous spectrum of development that ranges from purely practical to purely theoretical end points, and that a standard can be defined lightly as “anything agreed to by more than one party.” More formal definitions are certainly possible, and I’ve also discussed the various types of standards orga-nizations and the importance of defining our ter-minology precisely to understand this spectrum of development.

This time, I’ll compare and contrast the differ-ent types of cloud software components, and discuss the pros and cons of taking a combined development plus operations (“DevOps”) approach to accelerate progress on software and standards. I’ll focus on practical ways in which standards fit into familiar categories used by programmers on a day-to-day ba-sis, and on how rapid feedback can improve them for use in these settings.

Cloud Development Categories

For convenience, we can organize the components of cloud software and associated methods into broad categories. I present one such classification here. It isn’t the only possible scheme, and might not satisfy architecture purists, but I’ve simplified the discus-sion to focus on current cloud computing trends and needs. The point of this classification is to expose

features that relate directly to current innovation opportunities and to discuss the consequent need for standards development methods that can keep pace with rapid software progress.

Application Programming Interfaces

APIs have emerged as a key feature of the new cloud ecosystem. They’ve become so popular that they’re sometimes the only components of cloud software design that beginning programmers en-counter, and such beginners can be forgiven for thinking that these are the only components of

Cloud

Standards

and the

Spectrum of

Development

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________







STANDARDS NOW

cloud software that matter. APIs domi-nate current discussion to the point that they’ve developed their own con-ferences, trends, and place in the eco-nomic landscape. Nonetheless, they aren’t the entire picture and can’t stand on their own. We’ll have to delve deeper to understand their history and relation to other important cloud func-tionality and features.

APIs represent boundary-level con-ditions needed to transfer information into and out of cloud software environ-ments. Classically, these environments

were executable applications that were incapable of exposing their internal processes or parameters to the outside world for alteration or external con-sumption, hence the need to pass input, output, and control features through a defined interface. The other major cat-egory of boundary-level interfaces used in computing in general is often re-ferred to as application binary interfaces(ABIs). I’ll reserve discussion of ABIs to a later column.

In general, an interface defines fea-tures such as syntax, semantics, and optional versus required components of the information to be presented. In gen-eral use, APIs and ABIs also describe characteristics of the programming call sequences, such as classes and details of the methods to be used and objects or information to be exchanged.

As the popularity of cloud comput-ing grew, the API paradigm became so useful that almost all cloud software developed APIs, even if they weren’t interfaces to “applications” in the clas-sical sense. As this occurred, an evo-lution took place in the design of APIs and their use. Historically, APIs were often specific to the actual program-ming languages used and weren’t generally interchangeable between dif-ferent language calls to the methods. In cloud computing, the dominance of Web-based models and of their formal

service-oriented architecture under-pinnings produced allowed cloud APIs to be used across several different lan-guage implementations.

APIs based on the Representational State Transfer (REST) design pattern now dominate designs currently em-ployed in new cloud software. It’s worth noting, however, that earlier progress in decoupling APIs from dependence on specific language call interfaces and methods was driven by the previously dominant method in service-oriented architecture design, which is that of the pattern introduced in the late 1990s as the Simple Object Access Protocol (SOAP).

Web services based on SOAP and other closely related methods used XML to define interfaces so formally that code could actually be generated

entirely from a compact description of the interface, the Web Services De-scription Language (WSDL), without reference to any other knowledge of the interface’s characteristics.

This important development played a crucial role in getting programmers to think of APIs as potentially language-independent constructs that could be useful by themselves. Despite the obvi-ous value of the language independence afforded by WSDL and SOAP, program-mers eventually rebelled at the XML-only basis and prescriptiveness of these methods. Although they’re still in use in a variety of Web services and enjoy a strong following for certain types of programming, many of the new features of cloud methods have transitioned to the REST paradigm.

This style change has been driven partly by the desire to be able to refac-tor services in different ways to span smaller or larger portions of the prob-lems to be solved, and partly by the need for control to define the boundar-ies of the portion of the system exposed through an API. One of the defining characteristics of cloud computing is flexibility to draw this control bound-ary in ways that sometimes cross the conventional norms of service-oriented architecture.

Modern API design for cloud com-puting often uses design principles in ways that are beginning to resemble formal guidelines that lead in the di-rection of standards, or even to be expressed formally as standards. Asso-ciated tools are emerging to allow APIs to express discoverability of functional features and to build in self-descrip-tion of their characteristics and meth-ods of use.

Examples of API format descrip-tion and definition tools that include open formal specifications in addition to related implementation software in-

Successful cloud computing methods are characterized by their intrinsic utility, capacity for highly scaled

implementation, and ability to adapt to rapid change.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________







clude Swagger (http://swagger.io), API Blueprint (http://apiblueprint.org), and the RESTful API Markup Language (RAML; http://raml.org). These ap-proaches can be used to encourage or enforce API self-documentation and structure. In addition, a wide variety of related open source and commer-cial software has emerged to provide manageability, analytics, and other fea-tures, sometimes provided externally by third parties as filters or add-ons to ex-isting APIs.

Protocols

Protocols are another important com-ponent of cloud methods. No matter how well described, the interface (API or ABI) can’t exercise all of the func-tionality needed to control and interact with running processes alone. A com-plete approach to online communica-tion also needs protocols to define and describe the sequence of operations, format, and sequence of bits “on the wire” and characteristics such as tim-ing, content, or other design princi-ples that govern the information to be passed through it.

Protocols can be distinguished from APIs by the degree to which they specify interrelationships between dif-ferent aspects of the information to be presented, and often the time sequence, content, and/or ordering of data and op-erations. Protocols also cover address and data formats, and the mappings that are needed to interrelate them. They can express subtle characteristics of sequence and flow in ways that are difficult to express purely within the context of an API. TCP/IP, which gov-erns most operations on the Internet, is a good basic example.

Unlike APIs, protocols were origi-nally designed to be as independent of language implementations as possible. Because APIs have recently become in-

creasingly language independent, pro-tocols and APIs are now often designed in tandem, and many current cloud standards are written with both pro-tocol and API components. The Open Cloud Computing Interface (OCCI), for example, defines a boundary-level API and protocol for RESTful control of cloud computing components existing within the boundary of the system to be controlled. Other standards sometimes concentrate on one or the other of these aspects, or on specific details of control and communications.

REST-based standards and models that use HTTP as their transport pro-tocol can be further distinguished from each other by their use of hypermedia, which is an essential feature of modern cloud API usage that depends on the detailed nature of HTTP.

Protocols are generally used in or-ganized versions, so are best developed as standards. This aspect of cloud de-velopment is easy to miss, because it’s essentially taken for granted that good protocols will be used to organize the communications handled by our APIs. Organizations that develop protocols include all of the major standardization bodies, such as the World-Wide Web Consortium (W3C) and Internet Engi-neering Task Force (IETF), and essen-tially all of the organizations covered in previous columns.

Languages, Tools, and the Overall

Development Environment

Cloud implementations are written in a variety of programming languages with methods that are supplemented by an even wider variety of tools. It’s easy to ignore that these languages and tools are themselves often organized and de-fined by standards and divided into re-leases and versions. Use of languages and tools follows a pattern with wide variation in terms of size and type of supporting organization, and solutions with a single person underlying the ap-

proach aren’t unusual. There is only space in this column to touch lightly on this topic.

The concepts of interoperability and scalability have made the distinc-tion between different types of lan-guages and tools largely irrelevant by design as a deliberately targeted feature of cloud solution deployment. It’s taken for granted that a successful cloud in-frastructure won’t depend unduly on features of the programming methods used to create and implement it. This aspect is almost a design requirement for modern cloud development.

Formally standardized external analysis methods can also be applied usefully in cloud computing. One ex-ample is the TLA+ language, specifi-cation, and tools developed by Leslie Lamport (see http://research.microsoft

Formally standardized external analysis methods can also be applied

usefully in cloud computing. One example is the TLA+ language.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://swagger.io

http://apiblueprint.org

http://raml.org

http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html






STANDARDS NOW

.com/en-us/um/people/lamport/tla/tla

.html). Using mathematical set theory and predicates, the TLA+ approach de-scribes the legal behaviors of a system. Amazon has recently used TLA+ along with similar methods to find and elimi-nate problems due to time sequencing, dependencies, and design flaws within its software and infrastructure.1

Because of the importance of net-working to the successful deployment of cloud-based systems, a wide variety of work is ongoing to develop new stan-

dards and languages that can be used to express the features of networks in cloud settings. I will explore this topic further in a future column.

The environment in which cloud methods are developed and used is as important to their success as the in-terface, protocols, languages and tools used to implement them. Because the cloud environment emphasizes advan-tages in scalability, on-demand de-ployment, flexibility in interoperation, and bridging between multiple levels of information, people working in this area prefer tools with the same flexible characteristics.

Open project and code reposito-ries and software distributions laid the groundwork for the cloud environment, and lately this approach has been ex-tended to include methods for public sharing of libraries of entire virtual

machine and prebuilt custom container images. These are clearly not the only methods of software development and distribution, but they’ve gained impor-tance over the last several years.

A DevOps Approach for

Standards

Cloud computing emerged specifically to deliver methods for providing ser-vices that are easy to factorize and de-ploy, and can be implemented rapidly at greatly variable scales. Such a setting

requires procedures that allow quick cycling and continuous integration be-tween the development and operational deployment of cloud services. The in-dustry has therefore adopted the widely popular DevOps strategy, which com-bines aspects of development and op-erations to speed implementation and testing of new solutions and application of new methods.

Although earlier computing models could have used this approach, cloud computing has several characteristics, such as ease of simultaneous side-by-side comparisons of performance and factor-ization of services, that make the DevO-ps approach particularly attractive.

This approach can apply equally well to standards. Identifying methods to feed input from real-world experience back to standards developing bodies is an important and necessary step toward

improvement in any area, and is espe-cially needed in cloud computing. The quicker the feedback, the quicker we can expect progress.

Unfortunately, earlier standards development practices were based on slower and more formal communication patterns that don’t lend themselves to today’s rapid progress and rapid cycling between conceiving new ideas and test-ing them in the field. To alleviate this shortcoming, we need to take a DevOps approach to bridge the gaps more quick-ly between formal ideas and practical implementation. In doing so, we also need to scale the communication pat-terns horizontally to involve more opin-ions and feedback for the betterment of the field.

One reason that OpenStack (http://specs.openstack.org) is making such prog-ress now, for example, is that it has ex-posed its specification-writing process to community input and formalized the pro-cess of pulling resulting improvements into the project’s core development, selection, and verification procedures. Other similar software projects, such as CloudStack (https://cwiki.apache.org/confluence/display/cloudstack/design) and OpenNebula (http://community.opennebula.org/interoperability), are also providing such information. This approach will be strongest if mutual engagement occurs between standard-ization communities and software de-velopers in each project.

Engagement of this type is begin-ning to happen, and open source im-plementations of OCCI, Topology and Orchestration Specification for Cloud Applications (TOSCA), Cloud Data Management Interface (CDMI), Cloud Infrastructure Management Interface (CIMI), and other emerging cloud stan-dards are now available in each of the above software efforts, as well as in gen-eral-purpose software libraries suitable

Cloud computing emerged specifically to deliver methods for providing

services that are easy to factorize and deploy, and can be implemented

rapidly.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____

____

__________________________

_________________________


http://specs.openstack.org

https://cwiki.apache.org/confluence/display/cloudstack/design

http://community.opennebula.org/interoperability

http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html






for use in other settings. A quick search in the GitHub repositories will yield several relevant projects.

Where closed-source development still occurs, it also needs to be pursued in a way that encourages rapid cycling between ideas and implementations for standards to be effective in these set-tings. Some degree of interoperability can be extended to otherwise nonstan-dardized commercial products through associated open source projects. The Eutester project (https://github.com/eu-calyptus/eutester), for example, can be used to automate tests of a Eucalyptus or Amazon cloud. Although no formal open consensus-based standards exist to provide the community underpin-nings for Amazon-compatible products, projects such as Eutester can partially fill the gap between product features and their user communities.

Consensus Versus Speed, and

Rapid Testing as a Cure

The downside of taking a rapid-cycling approach aimed only at functionality is that it can place a great deal of pressure on the methods commonly used to de-velop consensus within open standards communities. Standards work best if they can be used to bridge the differ-ences between projects to provide the basis for interoperation. Developing tools that can be adopted effectively in multiple software projects and in com-mercial products taxes our collective ability to coordinate and test new fea-tures in different settings and build the consensus needed to create effective open standards.

Such consensus is one of the five core principles recently enumerated by the OpenStand effort . Several major standards developing bodies, includ-ing the Internet Society, the IETF, the Internet Architecture Board, the W3C, IEEE, and the Open Grid Forum, have

endorsed its joint statement of affirma-tion (see http://open-stand.org/about-us/affirmation).

Despite this strong endorsement, other standardization organizations have departed from or haven’t yet en-dorsed the OpenStand principles. Among these is the Web Hypertext Ap-plication Technology Working Group (WHATWG), an organization that formed a decade ago to pursue evolu-tion of hypertext-related specifications and explicitly includes a non-consensus-based membership steering group partly justified by the professed need for speed in development. Consensus unfortu-nately takes time and can produce slow and variable results.

One way to mitigate these problems is to encourage rapid testing against major implementations, which can sometimes squeeze out opinions not backed by large-scale organizational participants. The divergence between WHATWG and W3C specifications for HTML is an example of the potential pitfalls in this area. Cloud computing needs processes to create open active communication between development of software and standards without en-countering such difficulties.

FUTURE INSTANCES OF THIS COL-UMN WILL LOOK AT INDIVIDUAL STANDARDS IN TERMS OF CON-CEPTUAL FUNCTIONS THEY CAN BE USED TO PERFORM, SUCH AS IMAGE PORTABILITY, JOB PROVI-SIONING, AND TASK ORCHESTRA-TION. Meanwhile, the information presented in this column should help illustrate the use of standards in neces-sary components of day-to-day software development.

Please respond with your opinions on this or previous columns, especially if you disagree with me, and include any

news you think the community should know. You can reach me at [email protected].

References

1. C. Newcombe et al., “Use of Formal Methods at Amazon Web Services,” online publication, 29 Sept. 2011; http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf.

ALAN SILL directs the US National Sci-ence Foundation Center for Cloud and Autonomic Computing at Texas Tech University, where he is also a senior sci-entist at the High Performance Com-puting Center and adjunct professor of physics. He serves as vice president of standards for the Open Grid Forum and co-chairs the US National Institute of Standards and Technology’s “Standards Acceleration to Jumpstart Adoption of Cloud Computing” working group. Sill holds a PhD in particle physics from American University. He’s an active member of IEEE, the Distributed Man-agement Task Force, TM Forum, and other cloud standards working groups, and has served either directly or as liai-son for the Open Grid Forum on several national and international standards roadmap committees. For further details, visit http://cac.ttu.edu or contact him at [email protected].


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________________

__________________

________________________

_______________

_______

______

______________

___________

https://github.com/eucalyptus/eutester

http://open-stand.org/about-us/affirmation


http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf

http://cac.ttu.edu







MOBILE DEVICES (SUCH AS ANDROID, iOS, WINDOWS, AND BLACKBERRY DEVICES) AND MOBILE APPS ARE RAPIDLY BECOMING PART OF EVERYDAY LIFE FOR INDIVIDUAL AND ORGANIZATIONAL USERS IN BOTH DE-VELOPED AND DEVELOPING COUNTRIES.

One popular app category is apps that provide cloud-based storage services compatible with a range of devices, including PCs, laptops, and mo-bile devices. For example, Netskope reports that cloud storage apps such as Google Drive, Amazon CloudDrive, OneDrive, and iCloud were among the top 20 most popular cloud apps during the first half of 2014.1 Dropbox, another popular cloud storage app, had more than 100 million downloads on the Google Play store at the time of writing.

As with most popular consumer technologies, criminals can exploit vulnerabilities in mobile de-vices and operating systems or mobile apps to tar-get mobile device and app users. Because of their capability to store vast amounts of user data, cloud storage apps are a potential and attractive target for criminals.

Threats to Mobile Device and Cloud Storage

App Users

Gartner predicts that

[T]hrough 2017, 75% of mobile security breaches will be the result of mobile ap-plication misconfigurations. By 2017, the focus of mobile breaches will shift to tab-lets and smartphones from workstations. Through 2015, more than 75% of mobile applications will fail basic security tests.2

In May 2014, for example, a significant number of Australian iOS devices were reportedly hijacked and locked for ransom. Subsequent analysis determined that affected users’ iCloud accounts had been com-promised.3 According to various media articles, af-fected users who didn’t set a passcode prior to the hack had to reset their devices to factory settings, resulting in the erasure of all user data stored on the affected devices.

Mazin Yousif, editor in chief of this magazine, also questioned whether the recent incident in which iTunes customers in 119 countries received U2’s “Songs of Innocence” without their consent4

suggests that criminals could potentially target iOS mobile device management (MDM). In principle, it isn’t impossible that iOS MDM servers could be compromised, say by a malicious insider, to push malicious or potentially unwanted applications to iOS devices managed by the affected servers. For example, in recent work, Samuel O’Malley and I presented a method that a corrupt insider could use to facilitate (inaudible) data exfiltration from an air-gapped system without using any modified hard-ware.5 Such techniques could easily be used to exfil-trate data from cloud servers.

Christoph Stach and Bernhard Mitschang highlighted the implications of poor privacy man-agement approaches.6 They also pointed out that a vast majority of current mobile apps request access to highly sensitive data and personally identifiable information (PII), such as geographical location and contact data.

In other work, Christian D’Orazio and I pro-posed a generic process for identifying vulner-abilities and design weaknesses in iOS apps. Using this process, we revealed a previously unknown/

Mobile Cloud

Storage Users


KIM-KWANG RAYMOND CHOO

University of South [email protected]

CLOUD AND THE LAW

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________






Opportunity

MotivationGuardian

Crime

FIGURE 1. Routine activity theory. RAT

proposes that crime occurs when a

suitable target is in the presence of a

motivated offender and is without a

capable guardian.


unpublished vulnerability in a widely used Australian Government health-care app that consequently exposes the user’s sensitive data and PII stored on the device.7

This is, perhaps, not surprising because many mobile apps weren’t de-signed with user security and privacy in mind, owing to the rush to attract new consumers and accelerate the prod-uct’s time to market. Such a situation is somewhat similar to two or three de-cades ago when published cryptograph-ic protocols were subsequently found to be insecure.8

Suffice to note that threats to mo-bile device and cloud storage app users are real and increasingly important be-cause of the increasing amount of user sensitive data and PII stored on and transmitted from mobile devices and cloud storage and other apps (for exam-ple, using browsers and apps to upload and download corporate and personal data from mobile devices to cloud stor-age servers).

Routine Activity Theory Approach

The routine activity theory (RAT), of-ten used to explain criminal events, proposes that crime occurs when a suit-able target is in the presence of a moti-vated offender and is without a capable guardian.9

Offender motivation is a crucial el-ement of RAT, which assumes that of-fenders are rational and appropriately resourced actors operating in the con-text of high-value and poorly protected targets.10 The interaction between po-tential victims (in our context, mobile device and cloud storage app users), of-fenders, and situational conditions (for example, opportunities such as devices connecting to free Wi-Fi, and weak guardianship such as poor security hy-giene) influences the risk and impact of victimization.

I don’t think many of us want to wake up tomorrow and discover that the data we stored in the cloud was leaked and photos we assumed were private are no longer so. In September 2014, for example, a number of celebrities’ iCloud accounts were reportedly com-promised, resulting in the theft of (in-timate) photos from these compromised accounts.11–13 Apple subsequently con-firmed the incident14:

After more than 40 hours of investigation, we have discov-ered that certain celebrity ac-counts were compromised by a very targeted attack on user

names, passwords and security questions, a practice that has become all too common on the Internet.

Individual mobile cloud users must therefore be vigilant and take measures to protect the data stored on their mo-bile devices and in the cloud. Such mea-sures should target one or more of the following areas (see Table 1):

• Reducing opportunity (for example, increasing the effort required to offend);

• Enhancing guardianship (for ex-ample, increasing the risk of getting caught); and

• Reducing motivation (for example, reducing the rewards of offending).

In summary, security measures shouldn’t lag behind new technology trends. Fortunately, the private sector has enormous incentives for contribut-ing to mobile device/app and cloud se-curity. Now is certainly a good time to get into the business of mobile device/app and cloud security.

WE WELCOME YOUR CONTRIBU-TIONS AND ENCOURAGE YOU TO BE PART OF THE MOBILE AND CLOUD SECURITY LANDSCAPE.16

The risk is not just to the mobile device and cloud storage app users,

but also to the organizations they work for.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








CLOUD AND THE LAW

From a legal perspective, for example, what are the implications of user data and PII leakage from mobile devices? Should a cloud service provider be re-sponsible for pure economic loss to cloud service users due to its negligent acts? Other areas of interest include the po-tential surveillance risks faced by mobile cloud storage users, particularly in the aftermath of the revelations by Edward Snowden that the National Security Agency has been conducting wide-scale government surveillance, including those targeting mobile device and cloud us-ers. Therefore, another key question that needs to be examined is, “How do we balance the need for a secure cloud computing ecosystem and the rights of individuals to privacy against the need to protect the society from serious and or-ganized crimes, terrorism, and cyber and national security interests?”

References

1. Netskope, Netskope Cloud Report,

report 7/14 RS-33-1, 2014; www.netskope.com/wp-content/uploads/2014/07/NS-Cloud-Report-Jul14-RS-00.pdf.

2. Gartner, “Gartner Says Worldwide PC, Tablet and Mobile Phone Com-bined Shipments to Reach 2.4 Billion Units in 2013,” press release, 4 April 2013; www.gartner.com/newsroom/id/2408515.

3. AppleInsider staff, “Hackers Use ‘Find My iPhone’ to Lockout, Ran-som Mac and iOS Device Own-ers in Australia,” AppleInsider, 26 May 2014; http://appleinsider.com/articles/14/05/27/hackers-break-into-lock-macs-and-ios-devices-for-ransom-in-australia.

4. M. Williams, “Half a Billion iTunes Customers Receive Latest U2 Al-bum for Free,” The Guardian, 10 Sept. 2014; www.theguardian.com/music/2014/sep/09/u2-songs-of-innocence-itunes-customers-free-album.

5. S. O’Malley and K.-K.R. Choo, “Bridging the Air Gap: Inaudible Data Exfiltration by Insiders,” Proc. 20th Americas Conf. Informa-tion Systems (AMCIS 14), 2014; http://aisel.aisnet.org/amcis2014/ISSecurity/GeneralPresentations/12.

6. C. Stach and B. Mitschang, “Pri-vacy Management for Mobile Plat-forms—A Review of Concepts and Approaches,” Proc. 14th IEEE Int’l Conf. Mobile Data Management (MDM 13), 2013, pp. 305–313.

7. C. D’Orazio and K.-K.R. Choo, “A Generic Process to Identify Vulner-abilities and Design Weaknesses in iOS Healthcare Apps,” Proc. 48th Ann. Hawaii Int’l Conf. System Sci-ences (HICSS 15), to be published in 2015.

8. K.-K.R. Choo, Secure Key Establish-ment, Springer, 2009.

9. L.E. Cohen and M. Felson, “Social Change and Crime Rate Trends: A Routine Activity Approach,” Am.

Table 1. Suggested areas for improving data protection in mobile devices and cloud storage apps.

Security measures Reduce

opportunity

Enhance

guardianship

Reduce

motivation

Target hardening such as prompt installation of software and hardware

patches and antivirus software

Yes Yes No

Report lost or stolen devices and cybervictimization to appropriate

authorities

No Yes No

Delete data stored on mobile device before disposing of the mobile

device and deactivating the account.

Yes No Yes

Delete data from cloud accounts before deactivating the account or

before the contract expires for corporate cloud users (note that data

anonymization and data deletion are not the same). One could also

encrypt the data stored in the cloud before deleting the encryption

key and the encrypted data from the account before deactivating the

account or before the contract expires.

Yes No Yes

Avoid visiting websites of dubious repute or downloading unknown apps

from third-party app stores

Yes No No

Use device encryption and alphanumeric and nonguessable password

for cloud and other accounts

Yes Yes Yes

Use a two-step verification feature offered by cloud services such as

Apple15

Yes Yes No

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____

________________________

________

_______

________________________

________________________

_____________

________________________

________________________

_____

_______________________

_________________________


http://www.netskope.com/wp-content/uploads/2014/07/NS-Cloud-Report-Jul14-RS-00.pdf

http://www.gartner.com/newsroom/id/2408515

http://appleinsider.com/articles/14/05/27/hackers-break-into-lock-macs-and-ios-devices-for-ransom-in-australia

http://www.theguardian.com/music/2014/sep/09/u2-songs-of-innocence-itunes-customers-free-album

http://aisel.aisnet.org/amcis2014/ISSecurity/GeneralPresentations/12






Sociological Rev., vol. 44, no. 4, 1979, pp. 588–608

10. M. Felson. Crime and Everyday Life,Pine Forge Press, 1998.

11. L. Kelion, “Apple Toughens iCloud Security after Celebrity Breach,” BBC News, 17 Sept. 2014; www.bbc.com/news/technology-29237469.

12. D. Lewis, “iCloud Data Breach: Hacking and Celebrity Photos,” Forbes, 2 Sept. 2014; www.forbes.com/sites/davelewis/2014/09/02/icloud-data-breach-hacking-and-nude-celebrity-photos.

13. D. Wakabayashi and D. Yadron, “Apple Denies iCloud Breach,” Wall Street J., 2 Sept. 2014; http://online.wsj.com/articles/apple-celebrity-accounts-compromised-by-very-targeted-attack-1409683803.

14. Apple, “Update to Celebrity Photo Investigation,” Apple media advi-sory, 2 Sept. 2014; www.apple.com/pr/library/2014/09/02Apple-Media-Advisory.html.

15. Apple, “Frequently Asked Questions about Two-Step Verification for Ap-ple ID,” 2014; http://support.apple.com/kb/ht5570.

16. K.-K.R. Choo, “Legal Issues in the Cloud,” IEEE Cloud Computing,vol. 1, no. 1, 2014, pp. 94–96.

KIM-KWANG RAYMOND CHOO is a senior lecturer in the School of Infor-mation Technology and Mathematical Science at the University of South Aus-tralia. His research interests include cy-ber and information security and digital

forensics. He has published two books, six refereed monographs, nine refereed book chapters, and 101 refereed journal and conference articles. He is the recipient of various awards including a 2010 Australian Capital Territory Pearcey Award, 2009 Fulbright Scholarship, 2008 Australia Day Achievement Medallion and the British Computer Society’s Wilkes Award in 2007. Choo has a PhD in information security from Queensland University of Technol-ogy, Australia. Contact him at [email protected] or https://sites.google.com/site/raymondchooau.

Advertising Personnel

Marian Anderson: Sr. Advertising CoordinatorEmail: [email protected]: +1 714 816 2139 | Fax: +1 714 821 4010

Sandy Brown: Sr. Business Development Mgr.Email [email protected]: +1 714 816 2144 | Fax: +1 714 821 4010

Advertising Sales Representatives (display)

Central, Northwest, Far East:Eric KincaidEmail: [email protected]: +1 214 673 3742Fax: +1 888 886 8599

Northeast, Midwest, Europe, Middle East:Ann & David SchisslerEmail: [email protected], [email protected]: +1 508 394 4026Fax: +1 508 394 1707

Southwest, California:Mike HughesEmail: [email protected]: +1 805 529 6790

Southeast:Heather BuonadiesEmail: [email protected]: +1 973 304 4123Fax: +1 973 585 7071

Heather BuonadiesEmail: [email protected]: +1 973 304 4123Fax: +1 973 585 7071

Advertising Sales Representatives (Jobs Board)

Heather BuonadiesEmail: [email protected]: +1 973 304 4123Fax: +1 973 585 7071

ADVERTISER INFORMATION


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________________________

__________

________________________

_______________

________________________

___________________

______

________________ ________

_____________________

_________________

________________

______________ _________________

_______________ _________________

_________________

________________ ________________

http://www.apple.com/pr/library/2014/09/02Apple-Media-Advisory.html

http://support.apple.com/kb/ht5570


https://sites.google.com/site/raymondchooau











http://www.bbc.com/news/technology-29237469

http://www.forbes.com/sites/davelewis/2014/09/02/icloud-data-breach-hacking-and-nude-celebrity-photos

http://online.wsj.com/articles/apple-celebrity-accounts-compromised-by-very-targeted-attack-1409683803

https://sites.google.com/site/raymondchooau






SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D


Guest Editors’

Introduction:

Securing Big Data

Applications in

the Cloud

Bharat Bhargava, Purdue University

Ibrahim Khalil, RMIT University, Australia

Ravi Sandhu, University of Texas, San Antonio

Traditional security mechanisms are tailored for

small-scale data, so they don’t meet the needs

of big data analytics and storage applications.

This special issue aims to stimulate discussion

and research toward the innovation of

security and privacy mechanisms for big data

applications in a cloud environment.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








loud-based platforms are playing an increasingly important role in the context of big data analytics and storage applications. The ve-locity, volume, and variety of big data for large-scale cloud infra-

structures can’t be enhanced without security and privacy. Because traditional security mechanisms are tailored to securing small-scale data, they can’t meet the needs of big data. Moreover, the inher-ent vulnerabilities of a cloud-based environment require significant focus on both privacy and secu-rity together with risk management procedures. To stimulate discussion and invigorate research inter-est toward the innovation of security and privacy mechanisms for big data applications in a cloud en-vironment, this special issue discusses topics such as intrusion detection and attack prevention, risk awareness, secure and efficient data sharing, and ac-cess control.

The call for papers was well timed given the dy-namic ongoing research on security and privacy for cloud-based big data applications. We received nu-merous submissions, and, after a rigorous peer re-view process, we selected five articles for this spe-cial issue.

The Articles

In “Enhancing Big Data Security with Collabora-tive Intrusion Detection,” Zhiyuan Tan and his colleagues introduce a collaborative intrusion de-tection framework that focuses on efficiency, scal-ability, and self-adaption for big data applications in cloud computing. The system performs intrusion detection at both the host and network levels in a collaborative manner, using a model for parallel network summarization that utilizes cloud comput-ing features.

The article, “Risk-Aware Virtual Resource Man-agement for Multitenant Cloud Datacenters,” by Ab-dulrahman A. Almutairi and Arif Ghafoor, presents efficient risk-aware virtual resource management procedures that avoid information leakage in cloud-based multitenant sharing environments. The au-thors propose a sharing-based heuristic that reduces overall risk, and a partition-based heuristic that is scalable for large datacenters. They use sensitivity characterization to address the virtual resource as-

signment problem in environments with role-based access control (RBAC).

In “Efficient and Secure Transfer, Synchroniza-tion, and Sharing of Big Data,” Kyle Chard, Steven Tuecke, and Ian Foster propose secure and efficient data access, transfer, and sharing functions for large datasets across multiple types of local and cloud storage, which they achieve through the Globus software-as-a-service (SaaS) platform for data trans-fer and synchronization. Their secure framework supports resiliency and integrity while spanning a variety of heterogeneous data storage systems.

A fourth article, “Location-Based Security Framework for Cloud Perimeters,” by Chetan Jaisw-al, Mahesh Nath, and Vijay Kumar, proposes a cost-effective model for location-based firewall fil-tering of attacks for mobile and static cloud envi-ronments. The authors introduce two schemes for identifying and filtering out static and mobile secu-rity attackers using a logic-based framework that’s coupled with the dynamic revision of firewall poli-cies. These functions are performed in a distrib-uted manner, keeping the local and global policies in sync.

Finally, in “Multilabels-Based Scalable Access Control for Big Data Applications,” Chen Hong-song, Bharat Bhargava, and Fu Zhongchuan pro-pose a multilabel-based access control approach for Hadoop-based big data applications in clouds that is both efficient and scalable. The work combines active bundle, RBAC, discretionary access control (DAC), and mandatory access control (MAC), and includes a security degree, lifetime, and access policy among the multilabels. The authors evalu-ate the approach using a rigorous case study of a personal health record (PHR) data storage appli-cation. As both coauthor and guest editor, Bharat Bhargava did not take part in the peer review of this article.

e thank all of the authors who submitted manuscripts to this special issue. We also

wish to thank the reviewers who helped to review the papers in a very short time period, as well as Editor in Chief Mazin Yousif for his encouragement and support in organizing this special issue. Finally, we thank the publication staff for their continuous

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

support. We close this editorial by noting that several more feature topics on scalable and secure big data analytics are due to appear in the magazine in the near future.

BHARAT BHARGAVA is a professor of computer science at Purdue University. His research interests include security and privacy issues in distributed systems and sensor networks. This involves identity management, secure routing and dealing with mali-cious hosts, adaptability to attacks, and experimental studies. His recent work involves attack graphs for col-laborative attacks. Bhargava has a PhD in computer science from Rutgers University. Contact him at [email protected].

IBRAHIM KHALIL is a senior lecturer in the School of Computer Science and IT, RMIT University, Mel-bourne, Australia. His research interests include data clustering, network security, scalable computing in distributed systems, m-health, e-health, wireless and body sensor networks, biomedical signal processing, and remote healthcare. Khalil has a PhD in computer

science from the University of Berne, Switzerland. Contact him at [email protected].

RAVI SANDHU is the executive director of the Insti-tute for Cyber Security at the University of Texas, San Antonio, where he holds the Lutcher Brown Endowed Chair in Cyber Security. His research interests in-clude cybersecurity practice and education. Sandhu has a PhD in computer science from Rutgers Univer-sity. He is an IEEE, ACM, and AAAS Fellow. Contact him at [email protected].

stay connected.Keep up with the latest IEEE Computer Society

publications and activities wherever you are.

| IEEE Computer Society| Computing Now

| facebook.com/IEEEComputerSociety| facebook.com/ComputingNow

| @ComputerSociety| @ComputingNow

| youtube.com/ieeecomputersociety


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________

______________

___________

___

_________________________





http://www.facebook.com/IEEEComputerSociety

http://www.facebook.com/ComputingNow

http://www.youtube.com/ieeecomputersociety








Enhancing Big

Data Security with

Collaborative Intrusion

Detection

Zhiyuan Tan, University of Twente

Upasana T. Nagar, Xiangjian He, and Priyadarsi Nanda, University of Technology Sydney

Ren Ping Liu, Commonwealth Scientific and Industrial Research Organisation (CSIRO)

Song Wang, La Trobe University

Jiankun Hu, University of New South Wales

loud computing delivers a flexible network comput-ing model that allows organizations to adjust their IT capabilities on the fly with minimal investment in IT infrastructure and maintenance. Because an organi-zation need only pay for the services it uses, it can focus on its core business instead of handling techni-

cal issues.In the cloud computing context, network-accessible resources are

defined as services. These services are typically delivered via one of three cloud computing service models:

• Infrastructure as a service (IaaS) offers storage, computation, and network capabilities to service subscribers through virtual ma-chines (VMs).

• Platform as a service (PaaS) provides an environment for software application development and hosts a client’s applications in a PaaS provider’s computing infrastructure.

• Software as a service (SaaS) delivers on-demand software services via a computer network, eliminating the cost of purchasing and maintaining software.

A collaborative

intrusion detection

system (CIDS)

plays an important

role in providing

comprehensive

security for data

residing on cloud

networks, from

attack prevention to

attack detection.

SEC

UR

E B

IG D

ATA

IN T

HE

CLO

UD

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

These technical and business advantages, howev-er, don’t come without cost. The security vulnerabili-ties inherited from the underlying technologies (that is, virtualization, IP, APIs, and datacenter) prevent organizations from adopting the cloud in many criti-cal business applications.1 Generally speaking, cloud computing is a service-oriented architecture (SOA). Earlier work gives a comprehensive dependability and security taxonomy framework revealing the complex security cause-implication relations in this architec-ture.2 We summarize cloud computing vulnerabilities by underlying technology in the sidebar.

These vulnerabilities leave loopholes, allowing cyberintruders to exploit cloud computing servic-es and threatening the security and privacy of big data. Various security schemes, such as encryption, authentication, access control, firewalls, intrusion detection system (IDSs), and data leak prevention systems (DLPSs), address these security issues. In

this complex computing environment, however, no single scheme fits all cases. These schemes should thus be integrated and cooperate to provide a com-prehensive line of defense.

Intrusion Detection for Securing Cloud

Computing

IDSs aim to provide a layer of defense against mali-cious uses of computing systems by sensing attacks and alerting users. Because it’s impossible to prevent all cyberattacks, IDSs have become essential to se-curing cloud computing environments.

IDSs are commonly categorized by the type of data source involved in detection. Host-based IDSs (HIDSs) detect malicious events on host machines. They handle insider attacks (which attempt to gain unauthorized privileges) and user-to-root attacks (which attempt to gain root privileges to VMs or the host). Network-based IDSs (NIDSs) monitor and flag traffic carrying malicious contents or present-ing malicious patterns. This type of IDS can detect direct and indirect flooding attacks, port-scanning attacks, and so on.

Although to some extent, DLPSs can be consid-

ered a type of IDS, they’re more tailored to data se-curity. However, it’s difficult to completely guarantee data security using DLPSs alone. Attackers who gain control of the host machines can modify the DLPS settings, thereby completely disclosing data to those attackers. Moreover, even though firewalls can block unwanted network traffic packets according to a pre-defined rule set, they can’t detect sophisticated in-trusive attempts such as flooding and insider attacks. IDSs, DLPSs, and firewalls are therefore not inter-changeable security schemes but collaborative ones.

Conventional IDSs

Conventional IDSs are mostly standalone systems re-siding on computer networks or host machines. They can be categorized as misuse-based or anomaly-based IDSs, depending on the detection mechanism applied.

Misuse-based IDSs enjoy high detection accu-racy but are vulnerable to all zero-day intrusions.3

This is due to the underlying detection mechanism that checks for a match with existing attack signatures. Obvi-ously, an IDS can’t generate signatures for an unknown attack. Anomaly-based IDSs show promise for detecting zero-day intrusions,4,5 but are prone to high false positives.

Current enterprise networks (such as cloud computing environments) typi-cally have multiple entry points. This

topology is intended to enhance a network’s acces-sibility and availability, but it leaves security vulnera-bilities that sophisticated attackers can exploit using advanced techniques, such as cooperative intrusions.

Unlike traditional attack mechanisms, coop-erative attack mechanisms are launched simultane-ously by slave machines within a botnet. Attackers organize instances of this attack type to penetrate an enterprise network through all its entry points. By evenly distributing the attack traffic volume to the different entry points, these cooperative intru-sions can evade detection of traditional standalone IDSs set in front of the entry points. This is be-cause network traffic behavior at each entry point doesn’t significantly deviate from normal behavior. After traveling through the entry points, the attack instances are directed to a single targeted service within the enterprise network.

Moreover, many of the existing intrusions can occur collaboratively and simultaneously on nodes throughout a network. Attackers can initiate auto-mated attacks targeting all vulnerable services with-in a network simultaneously,6 rather than focusing on a specific service.

Attackers can initiate automated attacks

targeting all vulnerable services within a

network simultaneously.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________







Need for Collaborative Intrusion Detection

Conventional standalone IDSs are susceptible to cooperative attacks, so they’re unsuitable for col-laborative environments (such as a cloud computing environment). To defend against this type of attack, collaborative intrusion detection systems (CIDSs) correlate suspicious evidence between different IDSs to improve the intrusion detection efficien-cy. Unlike conventional standalone IDSs, a CIDS

shares traffic information with the IDSs located at a local network’s entry points.

In practice, we can organize IDSs within a CIDS in a decentralized7 or hierarchical8 manner over a large network. These IDSs communicate di-rectly with each other or with a central coordinator, according to the applied mode of organization.

In a decentralized CIDS, each IDS can gener-ate a complete attack diagram of the network by

VULNERABILITIES IN UNDERLYING TECHNOLOGIES

ulnerabilities in the cloud’s underly-

ing technologies allow cyberintruders

to exploit cloud computing services and

threaten the security and privacy of big data.

VirtualizationVirtualization facilitates multitenancy and re-

source sharing (such as physical machines and

networks) and enables maximum utilization

of available resources. Categories include full,

OS-layer, and hardware-layer virtualizations.

Virtual machines (VMs) can gain full access

to a host’s resources if isolation between the

host and the VMs isn’t properly configured and

maintained. (In this case, the VMs escape to

the host and seize root privileges.) In addition,

a VM’s security can’t be guaranteed if its host

is compromised. Hosts and their VMs share

networks via a virtual switch, which VMs could

use as a channel to capture the packets transit-

ing over the networks or to launch Address

Resolution Protocol (ARP) poisoning attacks.

Finally, because a host shares computing

resources with its VMs, a guest could launch

a denial-of-service (DoS) attack via a VM by

taking up all the host’s resources.

IP SuiteThe IP suite, the core component of the

Internet, ensures the functioning of inter-

networking systems and allows access to

remote computing resources.

Defects in the implementation of the

TCP/IP protocol suite can lead to a variety of

attacks, including IP spoofing, ARP spoofing,

DNS poisoning, Routing Information Proto-

col (RIP) attacks, flooding, HTTP session rid-

ing, and session hijacking.

Application Programming InterfacesAPIs provide interfaces for managing cloud

services, including service provisioning,

orchestration, and monitoring. Areas of

vulnerability include weak credentials,

authorization checks, and input-data valida-

tion, which could allow an attacker to seize

root privileges. Developers might introduce

defects during the design and implementa-

tion of cloud APIs or introduce new security

vulnerabilities when fixing bugs.

DatacenterDatacenter technologies allow administra-

tors to manage and store data. Data is often

stored, processed, and transferred in plain-

text, which can be compromised, lead-

ing to the loss of confidentiality. Attackers

might also find residual data from data

that’s been deleted. Finally, in a datacenter,

data from different users (both legitimate

users and intruders) is mixed together with

weak separation, providing opportunities

for an intruder to access the data of the

legitimate users.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

aggregating network information received from other IDSs in the CIDS. Detection of malicious attempts is undertaken locally at each IDS. In a hierarchical CIDS, a coordinator is a central point responsible for information aggregation. The central coordinator, which analyzes the aggregated information, generates a complete attack diagram of the network.

Limitations of Current Collaborative IDSs

Collaborative IDSs seem promising for detecting co-operative intrusions. However, existing system archi-tectures aren’t without criticism. In CIDSs, network data summarization is an important precursor to reli-able intrusion detection.9 However, traditionally, net-work information is collected and processed by IDS software built on a single network device that only deals with the traffic flowing in and out of that de-vice. It therefore has limited traffic information. In addition, the computation of network data summa-rization is proportional to the amount of traffic flow that single device experiences. Such an approach has drawbacks in terms of both accuracy and efficiency.

In terms of accuracy, without knowledge of the network data from other nodes, any summarization is specific to a partial and insignificant portion of all available data over the entire network. Exchanging and combining these summarizations later, without the actual data, provides a minimal information gain.

In terms of efficiency, nodes with denser traffic require additional computation to process summa-rization. Because summarization is a pure overhead operation, in an ideal environment, a node will have less traffic to process when performing summariza-tion tasks.

Security is another concern for existing CI-DSs. If a CIDS is compromised, the entire cloud computing environment is in danger. Conventional IDS software, installed on a single network device, analyzes and maintains network information on the device but doesn’t include security properties that ensure confidentiality, authentication, and integrity. Thus, CIDSs that are designed simply by integrating conventional IDS software without proper security enhancements are vulnerable to attacks.

Collaborative Intrusion Detection

Framework

Given the defects of existing CIDSs, a new sophis-ticated CIDS framework could strengthen the se-curity of cloud computing systems. However, cloud computing presents unique issues. With a large, dense network of nodes forming a cloud environ-ment, cloud computing offers us unprecedented opportunities for making available network data

from all nodes. At the same time, it requires that we perform summarization and combine the results in a distributed and parallel manner. In addition, be-cause we’re now dealing with all the network data in the entire cloud, where an unknown number of categories can exist, the summarization algorithms will need to expand their categories on demand to automatically create new clusters when they discov-er new types of traffic emerging.

Given the characteristics of cloud computing, we must consider several desirable properties when designing a new CIDS framework. These properties include fast detection of various attacks with minimal false positive rates, scalability with the expansion of the cloud computing system, self-adaption to changes in the cloud computing environment, and resistance to compromise.10 Figure 1 shows the framework of our proposed CIDS, which meets these requirements.

As Figure 1 shows, HIDSs and NIDSs cooper-ate to perform intrusion detection at the host and network levels, and each IDS in the network is equipped with signature- and anomaly-based detec-tors.11 This tactic ensures better detection accuracy in both known and unknown attacks.

There are two categories of nodes in this frame-work—cooperative agent and central coordinator.These nodes form a collaborative system whose se-curity is assured through the implementation of var-ious security mechanisms.

Cooperative Agents

Cooperative agents stand at the front lines and de-tect misuses on host machines or malicious behavior on networks. These agents are equipped with HIDSs or NIDSs depending on their location—agents in-stalled on a host machine to detect suspicious events are equipped with HIDSs, whereas agents monitor-ing traffic on a network are equipped with NIDSs.

In our framework, the cooperative agents located on host machines are a new type of HIDS, requiring no instrumentation within VMs and modeling pro-cesses at the VM granularity level (that is, treating VMs as individual processes and modeling VM be-haviors accordingly). This scheme ensures that our detection system complies with service-level agree-ments (SLAs) and legal restrictions, which might not allow an IaaS provider to make amendments or per-form intensive monitoring and surveillance on client VMs. It also alleviates the ineffectiveness of NIDSs on encrypted traffic. The host-based cooperative agents inform a central coordinator when they de-tect an intrusive behavior or activity.

Cooperative agents residing at the network level conduct first-tier detection, defending against

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________






Internet

Firewall

NIDS

NIDS

Firewall

Gateway

Centralcoordinator

Backup centralcoordinator

HIDS

Host machines

Host machines

Cloud computing environment

Host machines

HIDS

FIGURE 1. Framework of a collaborative intrusion detection system (CIDS). The figure illustrates how the different

types of fellow IDSs are deployed in a cloud computing environment, and how they cooperate with each other and

central coordinators in detecting intrusions. (HIDS: host-based IDS, NIDS: network-based IDS)


generic attacks that present abnormality within the network traffic and don’t involve sophisticated co-operation. The network-based cooperative agents alert a central coordinator to any suspicious pack-ets detected. Meanwhile, these agents summarize network traffic flowing through the network in a distributed and parallel manner. In network data summarization, the nonparametric Bayes could be a suitable machine learning approach for solving the challenges of cloud computing.12 Network sum-marization is particularly important for detecting cooperative intrusions, such as distributed denial-of-service (DDoS) attacks. These summarizations are periodically sent to a central coordinator, as we discuss next.

This parallel summarization is empowered by cloud computing through the MapReduce frame-work.13 The MapReduce framework provides seam-less and effortless integration of our CIDS framework into a distributed and parallel architecture by treating the network-based cooperative agents as slave nodes

and the central coordinator as a master node. The MapReduce framework manages all details, ranging from scheduling to information aggregation.

Central Coordinator

Finally, the network traffic aggregation is performed on the central coordinator, which generates a com-plete attack diagram of the entire network (that is, the cloud computing system). Based on this aggre-gation, the central coordinator is capable of captur-ing sophisticated cooperative intrusions that the individual network-based cooperative agents miss. When intrusive behaviors (including those identified by the cooperative agents and the central coordina-tor) are detected, the central coordinator raises an alert to a system administrator.

It’s worth noting that a hybrid detector com-bining misuse-based and anomaly-based detection mechanisms can help reduce the time needed to de-tect and enhance the detection accuracy of known and unknown attacks.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

Security Mechanisms

To ensure that the CIDS is resistant to compromise, we use authentication and encryption as well as an integrity check. Because the CIDS works 24/7, energy-efficient group key distribution schemes are preferable for secure key distribution and node au-thentication.14,15 These schemes provide a strong, secure mechanism for updating group keys when nodes join in or leave the network or a node is be-ing compromised. They’re also resilient to collusion attacks, in which multiple nodes are compromised and coordinated for attack. Finally, a backup central coordinator runs alongside the main coordinator to prevent a single point of failure. The coordinators’ roles can be exchanged depending on actual require-ments and network conditions.

uture studies will explore the framework’s im-plementation and application on different cloud

computing systems. Focuses of our future studies will be casted on algorithms for distributed and par-allel data summarization on cloud computing, and their implementation on the MapReduce framework, as well as new detection approaches for HIDSs.

Acknowledgments

The work described here was performed when Zhi-yuan Tan was a research associate with the School of Computing and Communications at the Univer-sity of Technology, Sydney.

References

1. C. Modi et al., “A Survey on Security Issues and Solutions at Different Layers of Cloud Comput-ing,” J. Supercomputing, vol. 63, no. 2, 2013, pp. 561–592.

2. J. Hu et al., “Seamless Integration of Depend-ability and Security Concepts in SOA: A Feed-back Control System Based Framework and Tax-onomy,” J. Network and Computer Applications,vol. 34, no. 4, 2011, pp. 1150–1159.

3. Y. Meng, W. Li, and L.-F. Kwok, “Towards Adap-tive Character Frequency-Based Exclusive Sig-nature Matching Scheme and Its Applications in Distributed Intrusion Detection,” Computer Networks, vol. 57, no. 17, 2013, pp. 3630–3640.

4. G. Creech and J. Hu, “A Semantic Approach to Host-Based Intrusion Detection Systems Using Contiguous and Discontiguous System Call Pat-terns,” IEEE Trans. Computers, vol. 63, no. 4, 2014, pp. 807–819.

5. Z. Tan et al., “A System for Denial-of-Service At-tack Detection Based on Multivariate Correla-

tion Analysis,” IEEE Trans. Parallel and Distrib-uted Systems, vol. 25, no. 2, 2014, pp. 447–456.

6. S. Savage, “Internet Outbreaks: Epidemiol-ogy and Defenses,” keynote address, Internet Soc. Symp. Network and Distributed System Security (NDSS 05), 2005; http://cseweb.ucsd.edu/~savage/papers/InternetOutbreak.NDSS05.pdf.

7. S. Ram, “Secure Cloud Computing Based on Mutual Intrusion Detection System,” Int’l J. Computer Application, vol. 2, no. 1, 2012, pp. 57–67.

8. S.N. Dhage and B. Meshram, “Intrusion Detec-tion System in Cloud Computing Environment,” Int’l J. Cloud Computing, vol. 1, no. 2, 2012, pp. 261–282.

9. D. Hoplaros, Z. Tari, and I. Khalil, “Data Sum-marization for Network Traffic Monitoring,” J. Network and Computer Applications, vol. 37, Jan. 2014, pp. 194–205.

10. A. Patel et al., “An Intrusion Detection and Pre-vention System in Cloud Computing: A System-atic Review,” J. Network and Computer Applica-tions, vol. 36, no. 1, 2013, pp. 25–41.

11. A.K. Jones and R.S. Sielken, Computer System Intrusion Detection: A Survey, tech. report, Dept. of Computer Science, Univ. of Virginia, 2000; http://atlas.cs.virginia.edu/~jones/IDS-research/Documents/jones-sielken-survey-v11.pdf.

12. N. L. Hjort et al., eds. Bayesian Nonparametrics,vol. 28, Cambridge Univ., 2010.

13. J. Dean and S. Ghemawat, “MapReduce: Simpli-fied Data Processing on Large Clusters,” Comm. ACM, vol. 51, no. 1, 2008, pp. 107–113.

14. B. Tian et al., “A Mutual-Healing Key Distribu-tion Scheme in Wireless Sensor Networks,” J. Network and Computer Applications, vol. 34, no. 1, 2011, pp. 80–88.

15. B. Tian et al., “Self-Healing Key Distribution Schemes for Wireless Networks: A Survey,” Computer J., vol. 54, no. 4, 2011, pp. 549–569.

ZHIYUAN TAN is a postdoctoral research fellow in the Faculty of Electrical Engineering, Mathematics, and Computer Science, University of Twente, En-schede, Netherlands. His research interests include net-work security, pattern recognition, machine learning, and distributed systems. Tan received a PhD from the University of Technology Sydney (UTS), Australia. He’s an IEEE member. Contact him at [email protected].

UPASANA T. NAGAR is a PhD student in the School of Computing and Communications at the University of Technology, Sydney (UTS), Australia,

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________________________________

___

___________________________

___________

_________________________


http://cseweb.ucsd.edu/~savage/papers/InternetOutbreak.NDSS05.pdf

http://atlas.cs.virginia.edu/~jones/IDS-research/Documents/jones-sielken-survey-v11.pdf







and a student member of the Research Centre for In-novation in IT Services and Applications (iNEXT) at UTS. Her research interests include network security, pattern recognition, and cloud computing. Nagar re-ceived a bachelor’s degree in electronics from the Na-tional Institute of Technology, Surat. Contact her at [email protected].

XIANGJIAN HE is a professor of computer science in the School of Computing and Communications at the University of Technology, Sydney (UTS). He’s also director of the Computer Vision and Recognition Laboratory, leader of the Network Security Research group, and a deputy director of the Research Centre for Innovation in IT Services and Applications (iN-EXT) at UTS. His research interests include network security, image processing, pattern recognition, and computer vision. He received a PhD in computer sci-ence from the University of Technology Sydney (UTS), Australia. He’s an IEEE senior member. Contact him at [email protected].

PRIYADARSI NANDA is a senior lecturer in the School of Computing and Communications at the University of Technology, Sydney (UTS), Australia. He’s also a core research member at the Centre for Innovation in IT Services Applications (iNEXT) at UTS. His research interests include network security, network QoS, sensor networks, and wireless networks. Nanda received a PhD in computer science from the University of Technology Sydney (UTS), Australia. He’s an IEEE senior member. Contact him at [email protected].

REN PING LIU is a principal scientist of network-ing technology at the Commonwealth Scientific and Industrial Research Organisation (CSIRO) and an ad-junct professor at Macquarie University and the Uni-versity of Technology, Sydney (UTS), Australia. His research interests include MAC protocol design, Mar-kov analysis, quality-of-service scheduling, TCP/IP internetworking, and network security. Liu received a PhD in electrical and computer engineering from University of Newcastle, Australia. He’s an IEEE se-nior member. Contact him at [email protected].

SONG WANG is a senior lecturer with the Depart-ment of Electronic Engineering, La Trobe University, Melbourne, Australia. Her research interests include biometric security, blind system identification, and wireless communication. Wang received a PhD in electrical and electronic engineering from the Uni-versity of Melbourne. Contact her at [email protected].

JIANKUN HU is a full professor and research direc-tor of the Cyber Security Lab, School of Engineering and IT, University of New South Wales at the Aus-tralian Defence Force Academy, Canberra, Australia. His research interests are in the field of cybersecurity including biometrics security. Hu received a PhD in control engineering from Harbin Institute of Technol-ogy, China. He’s an IEEE member. Contact him at [email protected].


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



________________________

_________________

_____________

_______

____________

________

_________

_________________________

_______________

______________________

____________









http://www.computer.org/software/author.htm


http://www.computer.org/software





Risk-Aware

Virtual Resource

Management for

Multitenant Cloud

Datacenters

Abdulrahman A. Almutairi and Arif Ghafoor, Purdue University

he cloud computing platform-as-a-service (PaaS) par-adigm allows application developers to deploy big data applications in the cloud. These applications can be found in the areas of healthcare, e-government, sci-ence, and business.1 PaaS cloud providers can host customer data stores on premise and outsource the

computation to virtual resources from multiple infrastructure-as-a-service (IaaS) cloud providers. These virtual resources can be hosted by multitenant public cloud providers such as the Amazon Elastic Compute Cloud (EC2). The sheer size of big data poses se-rious security challenges for these applications. The backend data store can use an access control mechanism to isolate and enforce controlled data sharing.2 However, when the data is transferred from the backend data store to application logic, it can be leaked through virtual resource vulnerabilities. In a multitenant environ-ment, untrusted tenants can exploit these vulnerabilities, increas-ing the data leakage risk.

This article focuses on virtual resource vulnerabilities that can cause data leakage, resulting in side-channel attacks and virtual ma-chine (VM) escape.3,4 Proposed solutions to this problem—such as trust-ed virtual domain,5 secure hypervisor,6 and Chinese wall policies7—of-

Efficient risk-aware

virtual resource

assignment

mechanisms for the

cloud’s multitenant

environment can

help to minimize

the risk of

information leakage

due to cloud

virtual resource

vulnerability.

SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








fer secure virtual resource isolation among tenants. However, achieving this isolation lowers resource utilization.

We propose intelligent virtual resource alloca-tion techniques that assign resources to a cloud cus-tomer’s data-centric applications. These techniques have low complexity and minimize the imposed risk. We assume a role-based access control (RBAC)8

mechanism for multitenant datacenter protection. However, the approach is generic and can be applied to any security policy, including discretionary or mandatory access control.

Virtual Resource Vulnerability

Elsewhere, we proposed a distributed access control architecture featuring a virtual resource manager (VRM).9 The VRM allocates virtual resources to cloud customers based on an access control policy enforced by an access control module (ACM), as Fig-ure 1a shows. In general, these resources are allocat-ed to satisfy some service-level agreement (SLA) re-quirements for each cloud customer and to minimize the cost of provisioning for PaaS cloud providers. As Figure 1b shows, the VRM includes workload estima-tion, resource vulnerability estimation, and resource assignment components. The workload estimation component estimates the sharing of data among dif-ferent roles of the RBAC policy. The VRM’s resource vulnerability estimation component uses security analysis tools to estimate virtual resource vulnerabil-ity.10 Subsequently, this component can be used to characterize virtual resources’ vulnerability to differ-ent security measurements—for example, highly se-cured or unsecured virtual resources. The resource assignment component uses the workload and vul-nerability estimations to assign the virtual resource to cloud customers’ applications with the goal of minimizing the total risk of data leakage.

The risk of data leakage depends on the access control policy and virtual resource vulnerabilities. ISO 27005 defines this risk as “the potential that a given threat will exploit vulnerability of an asset or group of assets and thereby cause harm to the organization.”11 Using this definition, we formulate the risk due to data leakage for an application in a datacenter as:

Risk = Assets × Vulnerability × Threat (1)

Here, we assume that a role’s (the cloud custom-er’s) assets are the number of data objects (such as tuples or files) it has access to and that are stored in PaaS. Vulnerability is the probability of data leakage as a result of virtual resource vulnerabilities. To capture the worst-case scenario for risk assessment, we assume that the threat is equal to 1 for all roles (cloud custom-ers). In other words, because of resource vulnerabili-ties, each cloud customer poses a threat in terms of ac-cessing other customers’ data objects, and vice versa.

We propose a workload approximation model based on a given RBAC policy and characterization of a cloud datacenter. Using this model, we present a risk-aware assignment problem as well as assignment heuristics for virtual resource allocation. Because of page limitations, we present our proofs elsewhere.12

RBAC Policy Model for Access Control

Modules

A datacenter RBAC policy defines permissions for roles to access data objects.8 We formally define this assignment as follows.

Definition 1: Given an RBAC policy P for a big datacenter where R is the set of roles and O is the set of data objects, we can represent the permission-to-role assignment PA as a directed bipartite graph G(V, E), where V = R ∪ O such that R ∩ O = ∅. The edges eij ∈ E in G represent the existence of role-to-permission assignment (ri × oj) ∈ PA in the RBAC policy P, where ri ∈ R and oj ∈ O.

A role vertex’s out-degree represents the role’s cardinality, and a data object vertex’s in-degree rep-resents the degree of sharing of that object among roles. Figure 2a represents an RBAC policy with |R| = 4 and |O| = 20 as a bipartite graph model. As the figure shows, the cardinality of role r1 is out-degree(r1) = 11. Also, the degree of sharing of data object o20 is in-degree(o20) = 4.

The VRM’s resource assignment component re-quires the cardinality of shared data objects among roles. For big datacenters, computing these cardi-nalities from the bipartite graph is a daunting task. We propose an alternative representation of RBAC by clustering all data objects that are accessed by the different roles into a set of nonoverlapping parti-tions. The set, W, consisted of the cardinalities of these partitions is the spectral model for the RBAC policy. We define this model as follows.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

Definition 2 (RBAC spectral model): Given a bipartite graph representation of RBAC policy G(V, E), let P(R) be the power set of R excluding the null set ∅. The spectral representation of RBAC is the set W, with its elements, indexed by P(R) and lexico-graphically ordered. Formally, let p ∈ P(R). Then, we define wp ∈W as:

w o o O r p e Ep k k i r oi k= ∈ ∀ ∈ ∃ ∈{ }: ,

Note that | W | = 2n – 1.This model has two advantages over the bipar-

tite graph model. First, we can use it to characterize an RBAC policy in terms of a datacenter’s sensitiv-

ity using a single parameter. Based on the degree of sharing among roles, the datacenter sensitivity can be high, medium, or low, as elaborated later. In addi-tion, the spectral model allows resource assignment based on a given percentage of datacenters. Varying this percentage can lead to variable complexity of an assignment algorithm.

The set W can be generated from the bipartite graph model of RBAC. The members wp ∈ W are nonoverlapping and can be viewed as vertices of a lattice (that is, binary n-cube) with n levels, where nis the number of roles. For example, nodes at level 1 of the lattice represent the cardinalities of partitions corresponding to unshared data objects belonging to

Resourseassignment

User‘s resourse request

(a)

(b)

Workloadestimation

VM1

VM2

VM3

VMK

Resourcevulnerabilityestimation

Bigdatastore

PaaS

PaaS

IaaS1

IaaS

VRM

ACM

Virtualization layer

Physical layer

Policybase

VRM ACM

VM1

VM2

VM3

VMK

IaaSm

Virtualization layer

Physical layer

FIGURE 1. The virtual resource management (VRM) architecture: (a) virtual resource design and (b) virtual

resource management mechanism.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







r1

w{1}

=3 w{3}

=2

w{1,2}

=0

w{1,2,3}

=2 w{1,2,4}

=2 w{1,3,4}

=0 w{2,3,4}

=0

w{1,3}

=0 w{1,4}

=2 w{2,3}

=0 w{2,4}

=0 w{3,4}

=1

R

(a)

(b)

PAOo

1

o2

o3

o4

o5

o6

o7

o8

o9

o10

o11

o12

o13

o14

o15

o16

o17

o18

o19

o20

r2

r3

r4

w{2}

=6 w{4}

=0

w{1,2,3,4}

=2

FIGURE 2. RBAC policy representation: (a) example of RBAC permission assignment and (b) spectral lattice

representation of RBAC.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

individual roles. The nodes at level 2 represent the cardinalities of data partitions that are accessed by two roles. Similarly, the nodes at level n contain data objects shared by all roles. The nodes of W can be indexed using the role IDs associated with the parti-tion, as Figure 2b shows. As mentioned above, the indices of W are subsets of P(R) and its elements are the cardinalities of partitions that can be accessed by all the roles in these subsets. Note that the total size of the datacenter is given as:

wpw Wp∀ ∈∑

The following example illustrates the spectral model.Example 1: Figure 2a shows an access control

policy with |R| = 4 and |O| = 20 as a bipartite graph. The spectral model is shown as a lattice in Figure 2b. Notice that w{1,4} = |{o9, o10}| = 2 because o9 and o10 are accessed by both the roles r1 and r4.

Datacenter Workload

Estimation for RBAC Policy

For a big datacenter, specifying the exact value of wp

is a challenge. One practical approach is to use car-dinality estimation techniques.13 For example, for a transactional workload, we can use the selectivity es-timation of query processing for a big datacenter to estimate a given query’s size,13 whereby that query (or a collection of queries) can correspond to a role. If we use role mining to design an RBAC policy, we can use role mining techniques such as multi-assignment clustering14 to estimate the cardinalities of the set W.Here, we assume the access of data objects in a data-center follows a Zipfian distribution, an assumption supported by the Yahoo Cloud Serving Benchmark (YCSB).15 Because in this distribution, some objects are shared by a large number of roles (queries) while most are shared among a smaller number of roles (queries), it can provide a heterogeneous workload for RBAC. The Zipfian distribution is given as follows:

f s Ni s

i

Nαα

, ,( )=−

=∑ 1

(2)

where N is the maximum rank, α is the selected rank, and s is the parameter to control the distribu-tion shape.

According to this distribution, if parameter s = 1, then the probability that a data object is assigned to a single role (which corresponds to setting the rank α = 1) doubles the probability of assigning that data object to two roles, the case for which rank α = 2. As the value of s increases, the number of data objects assigned only to individual roles becomes larger, as

Figure 3 shows. Note that the value of s should be greater than or equal to 1.

As Algorithm 1 (Figure 4) proposes, we can use the Zipfian distribution to generate a heterogeneous RBAC-based workload in two steps. In the first step, we classify data objects into n buckets, where each bucket represents the number of total data objects assigned to a lattice level in Figure 2b. For example, data objects in bucket 1 are exclusively accessed by one role, whereas all roles share data objects in bucket n. The number of data objects in each bucket follows Zipfian distribution. In the second step, we assign data objects from bucket i to randomly selected partitions at level i of the lattice. Note, the number of partitions at level i is n

i

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟.

Characterizing Datacenter

Sensitivity Using a Spectral Model

Based on the statistical property of the access con-trol workload, we propose a data sensitivity-based classification of cloud datacenters. The sensitivity classification depends on the level of data object sharing among roles. In particular, we define a data-center’s sensitivity as the average degree of sharing among its data objects. For example, in Figure 2, the datacenter average degree of sharing is (11 × 1 + 3 × 2 + 4 × 3 + 2 × 4)/20 = 1.85. If the degree of sharing on average is low, we say the datacenter has high sensitivity. On the contrary, if there is exten-sive sharing of data objects among roles, we say this datacenter has low sensitivity. The medium sensitiv-ity class falls in the middle.

We can also model the data object sharing and datacenter classification using the Zipfian distribu-tion. The key parameter to characterize datacenter sensitivity is the scalar parameter s of the Zipfian density function shown in Equation 2. As Figure 3 shows, the smaller the value of s, the more data ob-jects are uniformly distributed in the set W of the RBAC spectral model. In the following example, we illustrate how we can use Zipfian parameter s to classify datacenter sensitivity.

Example 2: For a datacenter with 0.5 × 106 data objects, suppose we have three RBAC policies (P1, P2 , P3) each with n = 30 roles. Figure 3 shows a histogram of objects across the spectral lattice. De-pending on the Zipfian distribution, we can identify three classes of datacenters—high sensitive (HSD), medium sensitive (MSD), and low sensitive (LSD)—with respect to policies P1, P2, and P3. For example, HSD has a large value of s (s ≥ 2) because the shar-ing of data objects among P1 roles is very small. On other hand, the LSD has a small value of s (1.5 > s ≥ 1), depicting extensive sharing of data objects among

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







roles of policy P3. The MSD has a value of s that falls in the middle (2 > s ≥ 1.5). Note that the num-ber of data objects at level 1 in HSD is double the number of data objects at level 1 in LSD.

Heterogeneous Virtual Resource

Vulnerability Characterization

In addition to workload characterization with respect to RBAC policy, the VRM also estimates the software vulnerability for a virtual resource—a VM in our case. The estimator uses VMs’ security vulnerabilities to qualitatively classify them into multiple classes. The classification is based on the common vulner-ability scoring system (CVSS) metric scores. CVSS uses an interval scale of 0–10 to measure vulnerabili-ty severity.16 To represent the probability of data leak-age, we convert the 0–10 scale to a 0–1 scale. Based on CVSS scores, we assume four discrete classes of VMs—highly secured, medium secured, low secured, and unsecured VMs. Although we select four classes to model the vulnerability with respect to heteroge-neous virtual resources, our solution is generalizable to an arbitrary number of heterogeneous classes.

In addition to the probability of leakage within a virtual resource, the VRM needs to consider leakage across virtual resources (VMs) within the same IaaS. VRM estimates the vulnerability of each IaaS cloud

provider. Different remote cloud providers can deploy different security configurations and virtualization software (such as a hypervisor) with varying levels of vulnerabilities. Similar to VM classification, the VRM estimator also classifies the vulnerabilities of remote cloud providers into multiple classes. The vulnerabil-ity measurements within VMs are independent from the vulnerability measurements across VMs. For

Input: Number of data objects |O|, number of roles n, constant s.

Output: spectral representation of RBAC W.

1. Let B = {B1,...,Bn} bucket array;

2. foreach i = 1,. . ., |O| do

3. α = zipf(n,s);

4. Bα = Bα + 1;

5. foreach i = 1, . . ., n do

6. foreach j = 1 , . . ., Bi do

7. α=⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟⎟

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟zipf

n

is,

8. map α to random partition in level i call it p ;

9. w wp pˆ ˆ= +1 ;

10. add wp to W

11. return W

FIGURE 4. Algorithm 1: Workload generation algorithm.

350,000

300,000

250,000

200,000

150,000

100,000

50,000

01 2 3 4 5 6 7 8 9 10 1211 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Spectrum lattice level

Nu

mb

er

of

dat

a o

bje

cts

Low sensitive (s = 1.1)

Medium sensitive (s = 1.5)

High sensitive (s =2.0)

FIGURE 3. A statistical characterization of sensitivity of cloud datacenters.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

example, a highly secure cloud provider can host un-secure VMs. We assume that the probability of leak-age between two VMs belonging to different remote clouds is negligible. Subsequently, we convert these qualitative measures into probability of data leakage. We assume that the probability of leakage within any VM is higher than the probability of leakage across any two VMs. This is because the size of the trusted code base across VMs is generally smaller than com-mercial operating systems used in a given VM. The trusted code base represents the software stack shared in the multitenant environment, whereas within a VM, the shared software stack (for example, the op-erating system and middleware) is larger than shared software across VMs (for example, the hypervisor).

Definition 3: (the cloud virtual resource model): Given VM1 , VM2 , . . ., VMm as the suite of m vir-tual machines available to a VRM, let di,j represent the probability of data leakage between VMi and VMj

estimated using the vulnerability estimation compo-nent, as Figure 1 shows. Accordingly, the cloud virtual resource can be modeled as a fully connected undi-rected graph H(V, E), where vertices V represent the set of VMs and the weights on the edges represent di,j.

Risk-Aware Resource Assignment

Based on the spectral model of a given RBAC policy and the aforementioned virtual resource vulnerabil-ity model, we formally define the risk-aware assign-ment problem (RAP).

Definition 4: Given the spectral representation W of the RBAC policy and the adjacency matrix (D) representing the probabilities of leakage among virtual resources, based on H(V, E), the RAP is to minimize the total risk of data leakage by assigning access control roles to the virtual resource.

The cost function of total risk is defined as

min min ( , )

max, ,

R w Threat p it p

i

n

w W

l q m j

p

= ×

×

=∈

≤ ≤ ∀

∑∑1

1 ∈∈× ×{ }

pl q iq jld I I, ,

where

p P R∈ ( )

I i R

Ii

iq

q

m

iq

=∑ = ∀ ∈

=

1

1

1 if role is assigned to VM qq

Threat p ri p

0

1

0

otherwise

if

otherwi

⎧⎨⎪⎪

⎩⎪⎪

=∉

( , )sse

⎧⎨⎪⎪

⎩⎪⎪

Theorem 1: The RAP problem is NP-complete.12

Assignment Heuristics

We propose two heuristics for solving RAP, which the resource assignment component in VRM (Figure 1b) can deploy. The first, the sharing-based heuristic (SBH), uses a best-fit strategy. In SBH, each role is assigned to the best available VM, in terms of prob-ability of leakage, such that any increase in the total risk is kept to a minimum. SBH has high complexity because it finds the local optimal assignment at each step. Alternatively, we propose a low-complexity scal-able heuristic, the partition-based heuristic (PBH), that uses a top-down clustering-based approach. In each step, this heuristic divides the roles based on the highest risk partition.

Because |W| = 2n – 1, to reduce the complexity of SBH and PBH, we propose an approximation strategy for workload characterization. The strategy is based on considering a smaller percentage of the datacen-ter’s total size. Let such a percentage be denoted as D.In particular, D identifies the cutoff level k in the lat-tice of P(R), which SBH and PBH can use. We can de-fine such a cutoff as k = min0≤k≤n Hk,s≥ D/100 × Hn,s,where Hk,s is kth generalized harmonic number.12

Accordingly, for a given value D of a datacenter, the spectral vector W needs to be truncated using cut-off level k. The truncated lexicographically order set, denoted W′, consists of all the partitions of W starting from level 1 up to and including the cutoff level k in the lattice of Figure 2b. In other words, W′ = {wp |( wp

∈W) ∧ |p| ≤ k} ≤ (n + 1)k.12

Note that different datacenter sensitivity classes yield different cutoff levels for the same percentage D. Accordingly, the size of W′ varies. The following example illustrates how D and the sensitivity classes can affect the value of k.

Example 3: For W of Example 2, when D = 70 percent, the cutoff (k) is 2 for HSD and 8 for LSD. For D = 95 percent, the cutoff is 18 for MSD, 9 for HSD, and 24 for LSD.

Sharing-Based Heuristic: Best-Fit Approach

Following a best-fit approach, SBH initially selects the role with the most data objects. It assigns this role to the VM with the least probability of leakage. Next, it selects the role that has the highest data sharing with the previously assigned roles and allocates the role to a VM such that any increment in the total risk is kept to minimum. This step is repeated until all roles are assigned. Notice that the first m role assignments are made to distinct VMs because the probability of leakage across VMs is always less than the probability of leakage within a VM. Therefore, the performance of SBH depends on the initial m role assignments. The remaining n – m

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







roles are co-allocated with the already assigned roles such that we keep the total leakage increase low.

Algorithm 2 (Figure 5) formally represents SBH. In Lines 1–3, SBH assigns an initial role to a VM, storing assigned roles in list A and keeping unassigned roles in list F. Cost matrix C stores the cost of assigning each unassigned role to each VM. In Lines 4–6, each iteration of the outer loop assigns a role from list F with the minimum value in C to a VM and updates the assignment matrix I, which includes assignment of all partitions so far. The inner loop updates C by removing the last assigned role’s entry and updating the entry of C resulting from the new assignment. At the end of its execution, the algorithm returns the final assignment matrix .

Lemma 1: The complexity of SBH is O(n2 × m× |W ′|).12

Partition-Based Heuristic: A Scalable Approach

PBH uses a scalable top-down clustering approach. Initially, we assume that all roles are in one cluster. We then find the highest attackable partition in W′.A partition’s attackability is defined as the size of the partition multiplied by the number of threats for that partition. The number of threats for a partition equals the total number of roles minus the partition’s level in the lattice of P(R). Note that as the size and the number of threats increase, the partition becomes more attackable. We begin with division of the root cluster and split it into two clusters. The first cluster contains the roles associated with highly attackable partitions. The remaining roles are stored in the second cluster. By splitting the roles into two clusters based on the highly attackable partition, we eliminate the possibility of co-allocating the roles associated with highly attackable partitions with other roles that pose threats to them. We repeat the last step until the number of clusters equals the number of VMs. Subsequently, we assign the cluster with the highest policy risk to the least vulnerable VM. In essence, this greedy approach of dividing the roles based on the highest attackable partition favors the m top attackable partitions over the others.

Algorithm 3 (Figure 6) formally represents PBH. In Lines 1–2, PBH sorts and saves W ′ in temporary set P . The initial cluster list C has only one cluster, which is the set of all roles. In Lines 4–11, the itera-tion loops over all the partition in P and divides a cluster into two new clusters if a partition intersects with any cluster in C. The loop in Lines 4–11 contin-ues until the number of clusters equals the number of VMs or until each cluster has only one role. Line 11 sorts the VM indexes in list L and computes the policy-based risk for each cluster. The policy-based

risk is computed according to equation (1) by setting vulnerability and threat parameters to 1. Line 14 sorts the clusters and Lines 15–17 assign the roles of a high risk cluster to a VM with low probability of leakage. The final assignment is returned by the algorithm through matrix I.

Lemma 2: The complexity of PBH is O(|W′|× log|W′| + n ×|W′|).12

Note that because of its subquadratic complexity in terms of number of roles, PBH is scalable.

Input: A spectral representation of RBAC W ′, vulnerability matrix D.

Output: An assignment matrix I of roles to VMs.

1. Find initial VM vmj with minimum djj;

2. Find initial Role ri with largest attackability;

3. I(ri, vmj) = 1;

4. Let A = {ri} be the set of assigned roles;

5. Let F = R – {ri} be the set of free roles;

6. Let C be the cost matrix with element Ci,j representing the

risk of assigning ri to vmj;

7. foreach ri ∈ F do

8. foreach j = 1, . . ., m do

9. Compute Ci,j;

10. Let Ck l, be minimum Ci,j;

11. I(rk, vml) = 1

12. A = A∩{rk};

13. F = F – {rk};

14. return I

FIGURE 5. Algorithm 2: Sharing-based heuristic (SBH).

Input: A spectral representation of RBAC W ′, vulnerability matrix D.

Output: An assignment matrix I of roles to VMs.

1. Sort W′ starting with highest attackable wi to the smallest

attackable wj;

2. Let P holds indices of sorted W ′;3. Let C = {{r1}, {r2}, . . . , {rn}} be the initial cluster;

4. foreach pi ∈ P do

5. foreach cj ∈ C do

6. if pi ∩ cj ≠ ∅ then

7. C = C – cj;

8. C = C ∪ (pi ∩ cj);

9. C = C ∪ (cj – pi);

10. if |C| ≥m then

11. break;

12. Let L = {l1, . . ., lm} be the sorted VMs based on dii form

smallest to largest where i ∈ {1, . . ., m};

13. Compute the intra risk for each ci ∈ C;

14. Let C be the sorted C based on cluster risk;

15. foreach i ∈ {1, . . ., m} do

16. foreach rk ∈ �ci do

17. I r vmk li,( )= 1 ;

18. return I;

FIGURE 6. Algorithm 3: Partition-based heuristic (PBH).

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

Performance Evaluation

We compare the performance of SBH and PBH using two metrics. The data leakage risk is the main metric that needs to be minimized. The risk metric has two submetrics that we use to evaluate the proposed heuristics. We compute the first risk metric from the total workload representing all partitions in W. This risk metric represents the total risk and is denoted as Rt. In other words, Rt represents the potential risk for the whole datacenter. We base the second submetric of risk on the partition of W′ and use it to study the performance of the heuristics and compare their effectiveness. This submetric—partial risk (denoted Rp)—corresponds to the risk resulting from the workload approximation W′. Formally, we write the risk metrics as follows:

R Risk wt pw Wp

= ( )∈∑ (4)

R Risk wp pw Wp

= ( )∈∑ '

(5)

Note that the difference between total risk and partial risk represents the relative risk error intro-duced by the heuristic. This error is a result of the heuristic’s lack of knowledge of all the partitions due to workload approximation. We define this error as

ER R

RRR

t p

p

t

p=

−( )= −⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟⎟

1

We evaluate our heuristics and study their perfor-mance for different statistical workload approxima-

tions of RBAC and data sensitivity classifications. Our workload generation algorithm uses the same statistical model used in YCSB,15 which bases the selectivity of data objects on the Zipfian distribu-tion. We determine datacenter sensitivity using the high, medium, and low settings. We consider dif-ferent data percentages (D) of a datacenter, varying from 70 to 95 percent of the datacenter’s total size. We simulate RBAC policies with 120, 150, and 200 roles. To manage simulation time, we assume a data-center of 500,000 data objects. However, we can ap-ply the proposed heuristics to any size datacenter.

Furthermore, in our experiment, we implement four classes of VM vulnerability: highly secured (with low probability of leakage di,i = 0.2), medium secured (di,i = 0.45), low secured (di,i = 0.6), and unsecured (with very high probability of leakage di,i = 0.8). Based on an earlier survey,10 we assume that 1 percent of VMs are unsecured, 22 percent are highly secured, 44 percent are medium secured, and 33 percent are low secured. Accordingly, we classify IaaS security into three categories:

• highly secure IaaS cloud providers, where we assume the probability of data leakage among VMs is extremely small (di,j = 0.001);

• moderately secure IaaS cloud providers, where di,j = 0.045; and

• the least secure IaaS cloud providers, where di,j

= 0.1.

Table 1 shows the total risk Rt resulting from SBH and PBH assignment for various number of

Table 1. Total risk (Rt) for the sharing-based and partition-based heuristics (SBH and PBH).

No. of

roles (n)

No. of

VMs (m)

Percentage of

the datacenter

(D)

Low sensitive Medium sensitive High sensitive

SBH PBH SBH PBH SBH PBH

120 40

70 3,360,231 3,606,591 2,445,374 2,452,659 1,518,554 2,139,564

80 3,291,886 3,522,308 2,244,157 2,401,973 1,465,960 1,512,723

90 3,215,064 3,470,285 2,154,764 2,309,507 1,231,830 1,405,526

95 3,344,730 3,470,285 2,217,724 2,302,854 1,224,026 1,406,141

150 50

70 4,121,669 4,225,904 2,932,017 3,263,288 1,767,732 2,658,996

80 4,078,425 4,196,071 2,785,004 2,828,237 1,756,312 2,234,971

90 4,068,345 4,196,071 2,663,637 2,782,716 1,558,556 1,761,015

95 4,038,736 4,201,984 2,647,521 2,782,716 1,468,684 1,692,846

200 70

70 5,685,296 5,402,101 3,865,387 4,362,990 2,228,901 3,627,837

80 5,562,226 5,352,938 3,686,288 3,613,758 1,973,325 2,837,346

90 5,580,674 5,351,282 3,356,819 3,446,952 1,692,089 2,355,639

95 5,475,651 5,350,779 3,355,768 3,446,952 1,754,023 1,970,093

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







roles (n) and number of VMs (m), and various sensi-tivity and data percentage settings. The Rt of HSD is higher than the total risk for MSD and LSD. This is because most data objects in HSD are highly attack-able, which increases the number of potential threats compared to other types of datacenters. In addition, Rt decreases as the value of D increases from 70 to 90 percent. This decrease in Rt occurs because for large values of D, VRM has more knowledge about W. In other words, as (W − W′) gets smaller, the un-certainty in assignment decision decreases. However, in some cases, Rt increases when the data percentage increases from 90 to 95 percent—for example, the

low sensitivity column for SBH and PBH. The reason is that both heuristics try to minimize the partial risk Rp associated with W′, and in some cases such mini-mization can result in an increase of Rt. In addition, we notice that varying n and m doesn’t change the heuristics’ behavior. However, as per Equation 3, the total risk Rt increases with n.

Figures 7a and 7b give the performance of SBH and PBH in term of partial risk Rp for a datacen-ter’s different sensitivity levels (high, medium, low) while varying the percentage D of the datacenter’s overall size. As the figures show, SBH outperforms PBH slightly for n = 150 and n = 200 but at a higher

Percentage of datacenter(a)

(c)

(e)

(g)

(b)

(d)

(f)

(h)

Ris

k

70

4,500,000

4,000,000

3,500,000

3,000,000

2,500,000

2,000,000

1,500,000

1,000,000

500,000

0

80 90 95

Percentage of datacenter

Ris

k

70

5,000,000

6,000,000

4,000,000

3,000,000

2,000,000

1,000,000

0

80 90 95

n = 200m = 70|T| = 500,000

SBH (high sensitive)

SBH (medium sensitive)

SBH (Low sensitive)

PBH (high sensitive)

PBH (medium sensitive)

PBH (Low sensitive)

n = 150m = 70|T| = 500,000



SBH (Low sensitive)

PBH (high sensitive)

PBH (medium sensitive)

PBH (Low sensitive)

Re

lati

ve e

rro

r

3.0

2.5

2.0

1.5

1.0

0.5

0

Percentage of datacenter70 80 90 95

n = 120m = 40|T| = 500,000

High sensitive

Medium sensitive

Low sensitiveR

ela

tive

err

or

109876543210


n = 120m = 40|T| = 500,000

High sensitive

Medium sensitive

Low sensitive

Re

lati

ve e

rro

r

3.0

2.5

2.0

1.5

1.0

0.5

0


n = 150m = 50|T| = 500,000



SBH (Low sensitive)

Re

lati

ve e

rro

r

109876543210


n = 150m = 50|T| = 500,000

High sensitive

Medium sensitive

Low sensitive

Percentage of datacenter

Re

lati

ve e

rro

r

3.0

2.5

2.0

1.5

1.0

0.5

0

70 80 90 95

n = 200m = 70|T| = 500,000



SBH (Low sensitive)

Re

lati

ve e

rro

r

109876543210


n = 200m = 70|T| = 500,000



SBH (Low sensitive)

FIGURE 7. SBH and PBH performance (risk an relative error): (a) Rp for SBH and PBH for n = 150, (b) Rp for SBH and PBH for n = 200,

(c) SBH relative error, (d) PBH relative error, (e) SBH relative error, (f) PBH relative error, (g) SBH relative error, and (h) PBH relative error.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

computation complexity. Figures 7c–7h compare the performance of SBH and PBH in terms of their rela-tive error E. The general behavior is that E decreas-es as D increases because the cutoff level k increas-es proportionally with D, resulting in a reduction of E. Further, E is higher for HSD than for LSD. This is because the cutoff level k for HSD is small, so the partitions that aren’t considered in the assignment decision are highly attackable. Consequently, these partitions impose higher risk than LSD and the total risk increases as well as the E. For PBH, E is high for HSD when D = 70 percent. This is because few of the partitions are highly attackable—that is, those at a low level of the spectral model. According-ly, PBH might not be an attractive choice for HSD when considering low values of D. However, as Table 1 shows, the relative error for PBH is within 10 per-cent of the error produced by SBH for both LSD and MSD cases. As a scalable algorithm, PBH offers a viable choice for these cases.

ith the growing use of software as a service (SaaS) for cloud datacenters, the complex in-

terplay of software and virtual machines exacerbates the security challenges addressed in this article. Our future work will consider the impact of services on the risk of data leakage resulting from joint vul-nerabilities of services and virtual machines.

Acknowledgments

This work was supported by US National Science Foundation grant IIS-0964639.

References

1. G.-H. Kim, S. Trimi, and J.-H. Chung, “Big-Data Applications in the Government Sector,” Comm. ACM, vol. 57, no. 3, 2014, pp. 78–85.

2. J. Alcaraz Calero et al., “Toward a Multi-Tenancy Authorization System for Cloud Services,” IEEE Security & Privacy, vol. 8, no. 6, 2010, pp. 48–55.

3. T. Ristenpart et al., “Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds,” Proc. 16th ACM Conf. Computer and Comm. Security, 2009, pp. 199–212.

4. M. Pearce, S. Zeadally, and R. Hunt, “Virtual-ization: Issues, Security Threats, and Solutions,” ACM Computing Surveys (CSUR), vol. 45, no. 2, 2013, p. 17.

5. L. Catuogno et al., “Trusted Virtual Domains–Design, Implementation and Lessons Learned,” Trusted Systems, Springer, 2010, pp. 156–179.

6. U. Steinberg and B. Kauer, “Nova: A Microhy-pervisor-Based Secure Virtualization Architec-

ture,” Proc. 5th European Conf. Computer Sys-tems, 2010, pp. 209–222.

7. S. Berger et al., “Security for the Cloud Infra-structure: Trusted Virtual Data Center Imple-mentation,” IBM J. Research and Development,vol. 53, no. 4, 2009, pp. 560–1.

8. D.F. Ferraiolo et al., “Proposed NIST Standard for Role-Based Access Control,” ACM Trans. In-formation and System Security (TISSEC), vol. 4, no. 3, 2001, pp. 224–274.

9. A. Almutairi et al., “A Distributed Access Con-trol Architecture for Cloud Computing,” IEEE Software, vol. 29, no. 2, 2012, pp. 36–44.

10. M. Balduzzi et al., “A Security Analysis of Ama-zon’s Elastic Compute Cloud Service,” Proc. 27th Ann. ACM Symp. Applied Computing, 2012, pp. 1427–1434.

11. ISO/IEC Std. 27005, Information Security Risk Management, ISO, 2011; https://www.iso.org/obp/ui/#iso:std:iso-iec:27005:edu-2:v1:en.

12. A. Almutairi and A. Ghafoor, Risk-Aware Virtual Resource Management for Access Control-Based Cloud Datacenters, CERIAS tech. report, Pur-due Univ., 2014.

13. H. Zhang, I.F. Ilyas, and K. Salem, “Psalm: Cardinality Estimation in the Presence of Fine-Grained Access Controls,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE 09), 2009, pp. 505–516.

14. M. Frank et al., “Multi-Assignment Clustering for Boolean Data,” J. Machine Learning Re-search, vol. 13, no. 1, 2012, pp. 459–489.

15. B.F. Cooper et al., “Benchmarking Cloud Serv-ing Systems with YCSB,” Proc. 1st ACM Symp. Cloud Computing, 2010, pp. 143–154.

16. P. Mell, K. Scarfone, and S. Romanosky, “A Com-plete Guide to the Common Vulnerability Scor-ing System Version 2.0,” FIRST-Forum of Incident Response and Security Teams, 2007, pp. 1–23.

ABDULRAHMAN ALMUTAIRI is a PhD student in the School of Electrical and Computer Engineer-ing at Purdue University. His research interests include information security and privacy and cloud computing systems. Almutairi received an MS in electrical and com-puter engineering from Purdue. He is a student member of IEEE. Contract him at [email protected].

ARIF GHAFOOR is a professor in the School of Electrical and Computer Engineering at Purdue Uni-versity. His research interests include information se-curity and distributed multimedia systems. Ghafoor received a PhD in electrical engineering from Colum-bia University. He is a fellow of IEEE. Contact him at [email protected].

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________

___________________________

_______________

______________

__________________________


https://www.iso.org/obp/ui/#iso:std:iso-iec:27005:edu-2:v1:en







PURPOSE: The IEEE Computer Society is the world’s largest

association of computing professionals and is the leading

provider of technical information in the field.

MEMBERSHIP: Members receive the monthly magazine

Computer, discounts, and opportunities to serve (all activities

are led by volunteer members). Membership is open to all IEEE

members, affiliate society members, and others interested in the

computer field.

COMPUTER SOCIETY WEBSITE: www.computer.org

OMBUDSMAN: To check membership status or report a change of

address, call the IEEE Member Services toll-free number,

+1 800 678 4333 (US) or +1 732 981 0060 (international). Direct

all other Computer Society-related questions—magazine delivery

or unresolved complaints—to [email protected].

CHAPTERS: Regular and student chapters worldwide provide the

opportunity to interact with colleagues, hear technical experts,

and serve the local professional community.

AVAILABLE INFORMATION: To obtain more information on any

of the following, contact Customer Service at +1 714 821 8380 or

+1 800 272 6657:

• Membership applications

• Publications catalog

• Draft standards and order forms

• Technical committee list

• Technical committee application

• Chapter start-up procedures

• Student scholarship information

• Volunteer leaders/staff directory

• IEEE senior member grade application (requires 10 years

practice and significant performance in five of those 10)

PUBLICATIONS AND ACTIVITIES

Computer: The flagship publication of the IEEE Computer

Society, Computer, publishes peer-reviewed technical content that

covers all aspects of computer science, computer engineering,

technology, and applications.

Periodicals: The society publishes 13 magazines, 16 transactions,

and one letters. Refer to membership application or request

information as noted above.

Conference Proceedings & Books: Conference Publishing

Services publishes more than 175 titles every year.

Standards Working Groups: More than 150 groups produce IEEE

standards used throughout the world.

Technical Committees: TCs provide professional interaction in

more than 45 technical areas and directly influence computer

engineering conferences and publications.

Conferences/Education: The society holds about 200 conferences

each year and sponsors many educational activities, including

computing science accreditation.

Certifications: The society offers two software developer

credentials. For more information, visit www.computer.org/

certification.

NEXT BOARD MEETING

26–30 January 2015, Long Beach, CA, USA

EXECUTIVE COMMITTEEPresident: Dejan S. Milojicic

President-Elect: Thomas M. Conte

Past President: David Alan Grier

Secretary: David S. Ebert

Treasurer: Charlene (“Chuck”) J. Walrad

VP, Educational Activities: Phillip Laplante

VP, Member & Geographic Activities: Elizabeth L. Burd

VP, Publications: Jean-Luc Gaudiot

VP, Professional Activities: Donald F. Shafer

VP, Standards Activities: James W. Moore

VP, Technical & Conference Activities: Cecilia Metra

2014 IEEE Director & Delegate Division VIII: Roger U. Fujii

2014 IEEE Director & Delegate Division V: Susan K. (Kathy) Land

2014 IEEE Director-Elect & Delegate Division VIII: John W. Walz

BOARD OF GOVERNORSTerm Expiring 2014: Jose Ignacio Castillo Velazquez, David S. Ebert,

Hakan Erdogmus, Gargi Keeni, Fabrizio Lombardi, Hironori Kasahara,

Arnold N. Pears

Term Expiring 2015: Ann DeMarle, Cecilia Metra, Nita Patel, Diomidis

Spinellis, Phillip Laplante, Jean-Luc Gaudiot, Stefano Zanero

Term Expriring 2016: David A. Bader, Pierre Bourque, Dennis Frailey, Jill

I. Gostin, Atsuhiro Goto, Rob Reilly, Christina M. Schober

EXECUTIVE STAFFExecutive Director: Angela R. Burgess

Associate Executive Director & Director, Governance: Anne Marie Kelly

Director, Finance & Accounting: John Miller

Director, Information Technology & Services: Ray Kahn

Director, Membership Development: Eric Berkowitz

Director, Products & Services: Evan Butterfield

Director, Sales & Marketing: Chris Jensen

COMPUTER SOCIETY OFFICESWashington, D.C.: 2001 L St., Ste. 700, Washington, D.C. 20036-4928

Phone: +1 202 371 0101 • Fax: +1 202 728 9614

Email: [email protected]

Los Alamitos: 10662 Los Vaqueros Circle, Los Alamitos, CA 90720

Phone: +1 714 821 8380


MEMBERSHIP & PUBLICATION ORDERS

Phone: +1 800 272 6657 • Fax: +1 714 821 4641 • Email: [email protected]

Asia/Pacific: Watanabe Building, 1-4-2 Minami-Aoyama, Minato-ku,

Tokyo 107-0062, Japan

Phone: +81 3 3408 3118 • Fax: +81 3 3408 3553


IEEE BOARD OF DIRECTORSPresident: J. Roberto de Marca

President-Elect: Howard E. Michel

Past President: Peter W. Staecker

Secretary: Marko Delimar

Treasurer: John T. Barr

Director & President, IEEE-USA: Gary L. Blank

Director & President, Standards Association: Karen Bartleson

Director & VP, Educational Activities: Saurabh Sinha

Director & VP, Membership and Geographic Activities: Ralph M. Ford

Director & VP, Publication Services and Products: Gianluca Setti

Director & VP, Technical Activities: Jacek M. Zurada

Director & Delegate Division V: Susan K. (Kathy) Land

Director & Delegate Division VIII: Roger U. Fujii

revised 6 Nov. 2014

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________

____________

___________

_______

_____________

__________

http://www.computer.org


http://www.computer.org/certification



mailto:17148214641�Email:[email protected]






SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D


Efficient and

Secure Transfer,

Synchronization, and

Sharing of Big Data

Kyle Chard, Steven Tuecke, and Ian Foster, University of Chicago

and Argonne National Laboratory

Globus supports standard data interfaces and common

security models for securely accessing, transferring,

synchronizing, and sharing large quantities of data.

loud computing’s unprecedented adoption by commercial and scientif-ic communities is due in part to its elastic computing capability, pay-as-you-go usage model, and inherent scalability. Cloud platforms are proving to be viable alternatives to in-house resources for scholarly ap-plications, with researchers in areas spanning physical and natural sci-ences through the arts regularly using them.1 As we enter the era of

big data and data-driven research—the “fourth paradigm of science”2—researchers face challenges related to hosting, organizing, transferring, sharing, and analyzing large quantities of data. Many believe that cloud models provide an ideal platform for sup-porting big data.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








Large scientific datasets are increasingly hosted on both public and private clouds. For example, pub-lic datasets hosted by Amazon Web Services (AWS) include 20 Tbytes of NASA Earth science data, 500 Tbytes of Web-crawled data, and 200 Tbytes of genomic data from the 1000 Genomes project. Open clouds such as the Open Science Data Cloud (OSDC)3 host many of the same research datasets in their collection of more than 1 Pbyte of open data. Thus, it’s frequently convenient, efficient, and cost-effective to work with these datasets on the cloud. In addition to these high-profile public datasets, many researchers store and work with large datasets distributed across a plethora of cloud and local stor-age systems. For example, researchers might use da-tasets stored in object stores such as Amazon Simple Storage Service (S3), large mountable block stores such as Amazon Elastic Block Store (EBS), instance storage attached to running cloud virtual machine (VM) instances, and other data stored on their insti-tutional clusters, personal computers, and in super-computing centers.

Given the distribution and diversity of stor-age as well as increasingly huge data sizes, we need standardized, secure, and efficient methods to ac-cess data, move it to other systems for analysis, syn-chronize changing datasets across systems without copying the entire dataset, and share data with col-laborators and others for extension and verification. Although high-performance methods are clearly required as data sizes grow, secure methods are equally important, given that these datasets might include medical, personal, financial, government, and intellectual property data. Thus, we need mod-els that provide a standard interface through which users can perform these actions and methods that leverage proven security models to provide a com-mon interface and single-sign-on. These approaches must also be easy to use, scalable, efficient, and in-dependent of storage type.

Globus is a hosted provider of high-performance, reliable, and secure data transfer, synchroniza-tion, and sharing.4 In essence, it establishes a huge distributed data cloud through a vast network of

Supercomputers and campus clusters

Personal resources

Glo

bu

s Ne

xus

InCommon/CILogon

MyProxyOAuth

OpenID

Object storage Block/drive storage

Globus Connect

Instance storage

Supercomputers and campus clusters

Personal resources

Glo

bu

sN

exu

s

InCommon/CILogon

MyProxyOAuth

OpenID®

Object storage Block/drive storage

Globus Connect

Instance storage

Synchronize

Share

Transfer

Access

FIGURE 1. Globus provides transfer, synchronization, and sharing of data across a wide variety of storage resources. Globus Nexus

provides a security layer through which users can authenticate using a number of linked identities. Globus Connect provides a

standard API for accessing storage resources.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

Globus-accessible endpoints—storage resources that implement Globus’s data access APIs. Through this cloud, users can access, move, and share large amounts of data remotely, without worrying about performance, reliability, or data integrity.

Globus: Large-Scale Research Data

Management as a Service

Figure 1 gives a high-level view of the Globus eco-system. Core Globus capabilities are split into two services: Globus Nexus manages user identities and groups,5 whereas the Globus transfer service man-ages transfer, synchronization, and sharing tasks on the user’s behalf.6 Both services offer program-matic APIs and clients to access their functionality remotely. They’re also accessible via the Globus Web interface (www.globus.org).

Globus Nexus provides the high-level security fabric that supports authentication and authoriza-tion. Its identity management function lets users create and manage a Globus identity; users can cre-ate a profile associated with their identity, which they can then use to make authorization decisions. It also acts as an identity hub, where users can link external identities to their Globus identity. Users can authenticate with Globus through these linked exter-nal identities using a single-sign-on model. Supported identities include campus identities using InCommon/CIlogon via OAuth, Google accounts via OpenID, XSEDE accounts via MyProxy OAuth, an Interoper-able Global Trust Federation (IGTF)-certified X.509 certificate authority, and Secure Socket Shell (SSH) key pairs. To support collective authorization deci-sions (such as when sharing data with collaborators), Globus Nexus also supports the creation and man-agement of user-defined groups.

The Globus transfer service provides core data management capabilities and implements an asso-ciated data access security fabric. Globus uses the GridFTP protocol7 to transfer data between logical endpoints—a Globus representation of an accessible GridFTP server. GridFTP extends FTP to improve performance, enable third-party transfers, and sup-port enhanced security models. The basic Globus model for accessing and moving data requires de-ploying a GridFTP server on a computer and regis-tering a corresponding logical endpoint in Globus. The GridFTP server must be configured with an authentication provider that handles the mapping of credentials to user accounts. Often, authentication is provided by a co-located MyProxy credential man-agement system,8 which lets users obtain short-term X.509 certificate-based proxy credentials by authen-ticating with a plug-in authentication module (for

example, local user accounts, Lightweight Directory Access Protocol [LDAP], or InCommon/CILogon).

Globus uses two separate communication chan-nels. The control channel is established between Globus and the endpoint to start and manage transfers, retrieve directory listings, and establish the data channel. The data channel is established directly between two Globus endpoints (GridFTP servers) and is used for data flowing between sys-tems. The data channel is inaccessible to the Globus service, so no data passes through it.

Several capabilities differentiate Globus from its competitors:

• High performance. Globus tunes performance based on heuristics to maximize throughput us-ing techniques such as pipelining and parallel datastreams.

• Reliable. Globus manages every stage of data transfer, periodically checks transfer perfor-mance, recovers from errors by retrying trans-fers, and notifies users of various events (such as errors and success). At the conclusion of a transfer, Globus compares checksums to ensure data integrity.

• Secure. Globus implements best practices secu-rity approaches with respect to user authentica-tion and authorization, securely manages the storage and transmission of credentials to end-points for authentication, and supports optional data encryption.

• Third-party transfer. Unlike most transfer mech-anisms (such as SCP [secure copy]) Globus facilitates third-party transfers between two re-mote endpoints. That is, rather than maintain a persistent connection to an endpoint, users can start a transfer and then let Globus manage it for the duration of the transfer.

• High availability. Globus is hosted using a dis-tributed, replicated, and redundant hosting model deployed across several AWS availabil-ity zones. In the past year, Globus and its con-stituent services have achieved 99.96 percent availability.

• Accessible. Because Globus is a software-as-a-service (SaaS) provider, users can access its capabilities without installing client software locally, so they can start and manage transfers through their Web browsers, or using the Glo-bus command-line interface or REST API.

In three and a half years of operation, Globus has attracted more than 18,000 registered users, of which approximately 200 to 250 are active every

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________


http://www.globus.org






day, and has conducted nearly 1 million transfers, collectively containing more than 2 billion files and 52 Pbytes of data. Figure 2 summarizes the Globus transfers over this period. The graphs include only transfer tasks (that is, they don’t include mkdir, de-lete, and so on) in which data has been transferred (for example, they don’t include sync jobs that don’t transfer files) between nontesting endpoints (that is, they ignore Globus test endpoints go#ep1 and go#ep2). Figure 2a shows the frequency of the total number of bytes transferred in a single transfer task (note log bins), and Figure 2b shows the frequency of the total number of files and directories trans-ferred in a single transfer task. As Figure 2a shows, the most common transfers are between 100 Mbytes and 1 Gbyte (81,624 total transfers), whereas more than 700 transfers have moved tens of Tbytes of data and 39 have moved hundreds of Tbytes (max 500.415 Tbytes). The most common number of files and directories transferred is less than 10; however, more than 400 transfers have moved more than 1 million files each (max 39,643,018), and 120 trans-fers have moved more than 100,000 directories (max 7,675,096). Figure 2 highlights the huge scale at which Globus operates in terms of data sizes trans-ferred, number of files and directories moved, and number of transfers conducted.

Extending the Globus Data Cloud

Globus currently supports a network of more than 8,000 active (used within the last year) endpoints distributed across the world and hosted at a vari-ety of locations, from PCs to supercomputers. Us-ers can already access and transfer data from many locations via Globus—supercomputing centers such as the National Center for Supercomputing Appli-cations (NCSA) and the San Diego Supercomput-er Center (SDSC); university research computing centers such as those at the University of Chicago; cloud platforms such as Amazon Web Services and the Open Science Data Cloud (OSDC); large user facilities such as CERN and Argonne National Lab-oratory’s Advanced Photon Source; and commercial data providers such as PerkinElmer. This vast col-lection of accessible endpoints ensures that new Globus users have access to large quantities of data immediately.

As new users join Globus, they often require ac-cess to new storage resources (including their own PCs). Thus, an important goal is to provide trivial methods for making resources accessible via Glo-bus. To allow data access via Globus, storage systems must be configured with a GridFTP server and some authentication method. To ease this process, we de-

veloped Globus Connect, a software package that can be deployed quickly and easily to make resources accessible to Globus. We developed two versions of Globus Connect for different deployment scenarios.

Globus Connect Personal is a lightweight single-user agent that operates in the background much like other SaaS agents (such as Google Drive and Dropbox). A unique key is created for each instal-lation and is used to peer Globus Connect to the user’s Globus account, ensuring that the endpoint is only accessible to its owner. Because we designed Globus Connect Personal for installation on PCs, it supports operation on networks behind firewalls and network address translation (NAT) through its use of outbound connections and relay servers (similar to other user agents such as Skype). Because it can run in user space, it doesn’t require administrator privileges. Globus Connect Personal is available for Linux, Windows, and MacOS.

Globus Connect Server is a multiuser server installation that supports advanced configuration

Number of files/directories

(a)

(b)

350,000

300,000

250,000

200,000

150,000

100,000

50,000

0

101 –10

2

101

102 –10

3

103 –10

4

104 –10

5

105 –10

6

106 –10

7

107 –10

8

108 –10

9

109 –10

10

Fre

qu

en

cy

FilesDirectories

Data transferred

90,00080,00070,00060,000

50,00040,00030,00020,000

0

1–10

byt

es

1 by

tes

10–10

0by

tes

100

byte

s–1

Kbyt

e10

–100

Kbyt

es

1-10

Kby

tes

100

Kbyt

es–1

Mby

te

1–10

Mby

tes

10–10

0 M

byte

s

100

Mby

tes–

1 G

byte

1–10

Gby

tes

10–10

0 G

byte

s

100

Gby

tes–

1 Tb

yte

1–10

Tby

tes

10–10

0 Tb

ytes

100

Tbyt

es–1

Pbyt

e

Fre

qu

en

cy

FIGURE 2. Frequency of transfers with given transfer size and number of

files and directories. Transfer task frequency for (a) total transfer size, and

(b) number of files and directories.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

options. It includes a full GridFTP server and an op-tional colocated MyProxy server for authentication. Alternatively, users can configure existing authen-tication sources upon installation. The installation process requires a one-command setup and comple-tion of a configuration file that defines aspects such as the endpoint name, file system restrictions, net-work interface, and authentication method. Glo-bus Connect Server also supports multiserver data transfer node configurations to provide increased throughput. Globus Connect Server is available as native Debian and RedHat packages.

With Globus Connect, users can quickly expose any type of storage resource to the Globus cloud. They can use lightweight Globus Connect Per-sonal endpoints on PCs and even short-lived cloud instances. They can even script the download and configuration of these endpoints for programmatic execution. For more frequently used resources with multiple users (such as data transfer nodes, clusters,

storage providers, long-term and high-performance storage such as High Performance Storage Sys-tem [HPSS]), they can deploy Globus Connect Serv-er and leverage institutional identity providers. They can then scale deployments over time by adding Glo-bus Connect Server nodes to load balance transfers. Both versions support all Globus features including access, transfer, synchronization, and sharing.

Supporting Cloud Object Stores

To allow users to access a variety of cloud storage systems, Globus supports the creation of endpoints directly on Amazon S3 object storage. Users can thus access, transfer, and share data between S3 and existing Globus endpoints as they do between any other Globus endpoints. To access S3, users must create an S3-backed endpoint that maps to a specific S3 bucket to which they have access. With this mod-el, users can expose access to large datasets stored in S3 and benefit from Globus’s advanced features, including high performance and reliable trans-fer, rather than relying on standard HTTP support (which doesn’t scale to large datasets and doesn’t en-

sure data integrity). Users can also leverage Globus’s synchronization and sharing capabilities directly from S3 endpoints.

Globus S3 endpoints support transfers directly from existing endpoints, so don’t require data staging via a Globus Connect deployment hosted on Ama-zon’s cloud. This approach differs from GreenButton WarpDrive (www.greenbutton.com), which, although it also uses GridFTP, relies on a pool of GridFTP servers hosted on cloud instances. Globus’s S3 sup-port builds upon extensions to GridFTP to support communication directly between S3 and GridFTP servers. Globus enables user-controlled registration of logical S3 endpoints requiring only details identi-fying the storage location (that is, the S3 bucket) and appropriate information required to connect to the S3 endpoint. To provide secure access to data stored in S3, while also enabling user-controlled sharing via Globus, we leverage Amazon’s Identity and Access Management (IAM) service to delegate control of

an S3 bucket to a trusted Globus user. We peer this Globus IAM user with the Globus transfer service via trusted cre-dentials. Thus, when delegating access of an S3 bucket, Globus can base autho-rization decisions on internal policies (such as sharing permissions) to allow transfers between other Globus end-points and the S3 endpoint.

Providing Scalable In-Place Data Sharing

One of the most common requirements associated with big data (and scientific data in general) is the ability to share data with collaborators. Current models for data sharing are limited in many ways, especially as data sizes increase. For example, cloud-based mechanisms such as Dropbox require that users first move (replicate) their data to the cloud, which is both costly and time consuming. Ad hoc models, such as directly sharing from institutional storage, require manual configuration, creation, and management of remote user accounts, making them difficult to manage and audit. These difficulties be-come insurmountable when data is large and when dynamic sharing changes are required. Rather than implement yet another storage service, we focus on enabling in-place data sharing. That is, shared data does not reside on Globus; rather, Globus lets users control who can access their data directly on their existing endpoints.

To share data in Globus, a user selects a file sys-tem location and creates a shared endpoint—that is, a virtual endpoint rooted at the shared location on his or her file system. The user can then select

One of the most common requirements

associated with big data is the ability to

share data with collaborators.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________


http://www.greenbutton.com






other users, or groups of users, who can access the shared endpoint—or parts thereof—by specifying fine-grained read and write permissions. One ad-vantage of this model is that permission changes are reflected immediately, so users can revoke access to a shared dataset instantly.

Globus’s sharing capabilities are extensions built onto the GridFTP server, which, when enabled, let the GridFTP server delegate authorization decisions to Globus. Specifically, two new GridFTP sitecommands let Globus check that sharing is enabled on an endpoint and create a new shared endpoint. We also extended the GridFTP access protocol to al-low access by a predefined trusted Globus user. The access request includes additional parameters such as the shared owner, shared user, and access con-trol list (ACL) for the shared endpoint, which Glo-bus maintains. When accessing the endpoint, this information is passed to the GridFTP server to en-able delegated authorization decisions from the re-questing user to the local user account of the shared endpoint owner. Using this approach, the GridFTP server can perform an authorization check to en-sure that the shared user can access the requested path before following the normal access protocol, which requires changing to the shared endpoint own-er’s local user account and performing the requested action.

Secure Data Access, Transfer, and Sharing

There are a wide range of potential security implica-tions when accessing distributed data, hosted by dif-ferent providers, across security domains, and using different security protocols. Globus’s multilayered architecture leverages standard security protocols to manage authentication and authorization, and avoid unnecessary storage of (or access to) users’ creden-tials and data. Most importantly, data does not pass through Globus; rather, it acts as a mediator, allow-ing endpoints to establish secure connections be-tween one another.

Authentication and Authorization

At the heart of the Globus security model is Glo-bus Nexus, which facilitates the complex security protocols required to access the Globus service and endpoints using Globus identities as well as linked external identities.

Globus stores identities (and groups) in a con-nected graph. For Globus identities, it stores hashed and salted passwords for comparison when authen-ticating. For the linked identities (SSH public keys,

X509 certificates, OpenID identities, InCommon/CILogon OAuth, and so on) used to provide single-sign-on, it stores only public information, such as SSH public keys, X509 certificates, OpenID identity URLs and usernames, and OAuth provider servers, certificates, and usernames. Thus, when authenti-cating, Globus can validate a user’s identity by fol-lowing the private authentication process using cryptographic techniques rather than comparing passwords. Consider, for example, authenticating using a campus identity. Here, Globus leverages the InCommon/CILogon system and the OAuth proto-col to let users enter their username/password via a trusted campus website. Globus passes a request token with the user authentication and receives an OAuth token and signature in return, which it ex-changes for an OAuth access token (and later a cer-tificate) from the campus identity provider.

Linked identities, such as XSEDE identities, are also used for single-sign-on access to endpoints.

Rather than require users to authenticate mul-tiple times for every action and to allow Globus to manage transfers on a user’s behalf, Globus stores short-term proxy credentials. This allows Globus to perform important transfer-management tasks such as restarting transfers upon error. Here, Globus stores an active proxy credential that can be used to impersonate the user, albeit for a short period of time. To do so securely, Globus only caches the ac-tive credential and encrypts it using a private key owned by Globus Nexus. When the active credential is required (for example, to compute a file checksum on an endpoint), the credential is decrypted and passed to the specific GridFTP server over the en-crypted control channel.

Endpoint Data Access and Transfer

GridFTP uses the Grid Security Infrastructure (GSI), a specification that allows secure and del-egated communication between services in distrib-uted computing environments. GridFTP relies on external services to authenticate users and provide trusted signed certificates (typically from a MyProxy

Globus stores an active proxy credential

that can be used to impersonate the

user, albeit for a short period of time.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

server) used to access the server. These certificates are often hidden from users by the use of an on-line certificate authority (CA), such as MyProxy. The GridFTP service has a certificate containing the hostname and host information that it uses to identify itself. (This certificate is created automati-cally when users install Globus Connect or it can be issued by a CA.) In Globus Connect, the MyProxy server can be optionally installed to issue short-term certificates on demand. Globus Connect can also be configured to use external MyProxy servers. Globus, GridFTP, and MyProxy servers are configured to trust the certificates exchanged between each other.

MyProxy servers let users obtain short-term cre-dentials that a GridFTP server uses to assert user access to the file system. Administrators can con-figure MyProxy servers to use various mechanisms for authentication through pluggable authentication

modules (PAMs). Usually, these PAMs support local system credentials or institutional LDAP creden-tials. There are two basic models in which Globus uses a MyProxy server to obtain a credential. In the first, Globus passes the user’s username and pass-word to the MyProxy server and receives a credential in response. Thus, users must trust Globus not to store their passwords and to transfer them secure-ly. In the second and preferred model, Globus uses the OAuth protocol to redirect the user to the My-Proxy server to authenticate directly (that is, Globus doesn’t see the username and password), and the server returns a credential in the OAuth redirection workflow.

When accessing data on an endpoint, Globus uses SSL/TLS to authenticate with the registered GridFTP server using the user’s certificate. The GridFTP server validates the user’s certificate, re-trieves a mapping to a local user account from a predefined mechanism (such as a GridMap file), and changes the local user account (used to access the file system) to the requesting user’s local account. Subsequent file system access occurs as the authen-ticated user’s local account. To provide an additional layer of security, endpoint administrators can con-figure path restrictions (restrict_paths) that

limit GridFTP access to particular parts of the file system. For instance, administrators might allow ac-cess only to users’ home directories or to specialized locations on the file system.

The flow of data between endpoints (including S3-backed endpoints and shared endpoints) is an-other potential area of vulnerability because data can travel on the general Internet. To provide se-cure data transfer, Globus supports data encryption based on secure sockets layer (SSL) connections between endpoints. In the case of S3 endpoints, the connection uses HTTPS. To avoid unnecessary overhead of less sensitive data, encryption is not a default setting and must be explicitly selected for in-dividual transfers. The control channel used to start and manage transfers is always encrypted to avoid potential visibility of credential, transfer, and file system information.

Secure Sharing

Globus sharing creates several new se-curity considerations, such as requiring secure peering of shared endpoints and Globus, authorizing access to shared data, and ensuring that file system in-formation is not disclosed outside of the restricted shared endpoint.

The Globus sharing model requires the GridFTP server to be explicitly configured to al-low sharing. As part of this process, the GridFTP server is configured to allow a trusted Globus user to access the server (and to later change the local user account to the shared endpoint owner’s local user account). A unique distinguished name (DN) ob-tained from a Globus CA operated for this purpose identifies the user. The GridFTP server is config-ured to trust both this special Globus user and the Globus CA via the registered DN. During configura-tion, administrators can set restrictions (sharing_rp) defining what files and paths may be shared on the file system and which users may create shared endpoints. For example, administrators could limit sharing to a particular path (analogous to a public_html directory) and a subset of administrative users.

As part of shared endpoint creation, a unique token is created on the GridFTP server for each shared endpoint. This token is used to safeguard against redirection and man-in-the-middle at-tacks. For instance, an attacker who gains control of a compromised Globus account might change the physical GridFTP server associated with a trusted endpoint (for example, an XSEDE endpoint) to a malicious endpoint under the attacker’s control. In this case, the attacker can create a shared end-

As part of shared endpoint creation, a

unique token is created on the GridFTP

server for each shared endpoint.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







point and can then change the physical server back to the trusted server. Because the unique token is created on the malicious server, it won’t be present on the trusted (XSEDE) server, so the attacker won’t be able to exploit the shared endpoint to access the trusted server.

Accessing data on a shared endpoint using the extended GridFTP protocol lets Globus access the GridFTP server (as the trusted Globus account). The extended access request specifies data loca-tion, shared endpoint owner, the user accessing the shared endpoint, and current ACLs for that shared endpoint. To ensure that data is accessed only within the boundaries of what has been shared and within restrictions placed by the server administra-tor, the GridFTP server checks restricted paths, shared paths, and Globus ACLs (in that order). As-suming nothing negates the access, the GridFTP server changes the local user account, with which it accesses the file system, to the shared endpoint owner’s local user account and satisfies the request.

Finally, because potentially sensi-tive path information could be includ-ed in a shared file path, Globus hides the root path from users accessing the shared endpoint. For example, if a user shares the directory “/kyle/secret/,” it will appear simply as “/~/’’ through the shared end-point. Globus translates paths before sending re-quests to the GridFTP server.

Hosting and Security Policies

All Globus services are hosted on AWS. Although this environment has many advantages, such as high availability and elastic scalability, as with all host-ing options, it also has inherent risks. We mitigate these risks by following best practices with respect to deployment and management of instances. These practices include storing all sensitive state encrypt-ed, isolating data stores from the general Internet so they’re only accessible to Globus service nodes (by AWS security groups), performing active intrusion detection and log monitoring to discover threats, au-diting public-facing services and using strict firewalls to restrict access to predefined ports, and establishing backup processes to ensure that all data is encrypted before it’s put in cloud storage. To ensure that these practices are followed, we conducted an external se-curity review,9 and resolved the identified issues.

One important security aspect relates to policies for responding to security breaches and vulnerabili-ties. The recent HeartBleed bug is an example of a security vulnerability that affected a huge number

of websites across the world. Although Globus uses custom data transfer protocols that are unlikely targets of such an attack, exploits via the website, endpoints, and linked identity providers are still possible. In this particular case, we followed pre-defined internal security policies to determine if the vulnerability impacted our services, patched the issue for all Globus services and Globus-managed endpoints, and generated new private keys. We then followed internal processes for responding to potentially compromised user access by revoking user access tokens (invalidating all user sessions) and analyzing access logs. Finally, because of the exploit’s nature, we analyzed all user endpoints to identify potentially vulnerable endpoints. We then contacted administrators of these endpoints and recommended that they take specific measures to patch the systems.

s data sizes increase, researchers must look toward more efficient ways of storing, organiz-

ing, accessing, sharing and analyzing data. Although Globus’s capabilities make it easy to access, trans-fer, and share large amounts of data across an ever-increasing ecosystem of active data endpoints, it also provides a framework on which new approaches for efficiently managing and interacting with big data can be explored.

The predominant use of file-based data is of-ten inefficient because the data required for analy-sis doesn’t always match the model used to store it. Researchers typically slice climate data in different ways depending on the analysis—for example, geo-graphically, temporally, or based on a specific type of data such as rainfall or temperature. Accessing entire datasets when only small subsets of it are of interest is both impractical and inefficient. Although some data protocols, such as the Open source Proj-ect for a Network Data Access Protocol (OpenDAP), provide methods for accessing data subsets within files, no standard model for accessing a wide range of data formats currently exists. Recently, research-ers have proposed more sophisticated data access models within GridFTP that use dynamic query and subsetting operations to retrieve (or transfer) data

One important security aspect relates

to policies for responding to security

breaches and vulnerabilities.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

subsets.10 Although this work presents a potential model for providing such capabilities, further work is needed to generalize the approach across data types and to develop a flexible and usable language to express such restrictions.

Files typically contain valuable metadata that can be used for organization, browsing, and discov-ery. However, accessing this metadata is often dif-ficult because it’s stored in various science-specific formats, often encoded in proprietary binary for-mats, and typically unstructured (or at least doesn’t follow standard conventions). Moreover, even when the metadata is accessible, few high-level methods exist for browsing it across many files or across stor-age systems. Often, the line between metadata and data is blurred, and, whereas metadata might be un-necessary for some analyses, it can be valuable for others. Thus, we need methods that enable struc-tured access to both data and metadata using com-mon formats. Given that metadata can describe data or contain other sensitive information (for example, patient names), it’s equally important to provide se-cure access methods. We therefore need models that expose such metadata to users and let them query over it to find relevant data for analysis or share it in a scalable and secure manner.

Often, data sharing occurs for the purpose of publishing to the wider community or as part of a publication. Considerable research has explored current data publishing practices.11,12 In many cas-es, researchers found that data wasn’t published with papers and that original datasets couldn’t be lo-cated. This affects one of the core principles of sci-entific discovery: that research is reproducible and verifiable. In response, funding agencies and pub-lishers are increasingly placing strict requirements on data availability associated with grants and pub-lications, although these requirements are often disregarded.12 Even when researchers do publish data, they often do so poorly, in an ad hoc manner that makes the data difficult to find and understand (due to a lack of metadata), and with little guaran-tee that the data is unchanged or complete. We need new systems that let researchers publish data, eas-ily associate persistent identifiers (such as DOIs) with that data, provide guarantees that the data is immutable and consistent with what was published, provide common interfaces for discovering and ac-cessing published data, and do so at scales that cor-respond to the growth of big data.

Although these three areas represent different research endeavors, they all require a framework that supports efficient and secure data access. Glo-bus provides a model on which we can continue to

innovate in these areas to provide enhanced capabil-ities directly through the existing network of Globus endpoints. We benefit from using Globus’s transfer and sharing capabilities and from leveraging the same structured approaches toward authentication and authorization.

We intend to continue to develop support for other cloud storage and cloud providers, such as per-sistent long-term storage like Amazon Glacier and storage models used by other cloud providers (Mi-crosoft Azure Storage, for example), with the goal of developing an increasingly broad data cloud.

Acknowledgments

We thank the Globus team for implementing and op-erating Globus services. This work was supported in part by the US National Institutes of Health through NIGMS grant U24 GM104203, the Bio-Informatics Research Network Coordinating Center (BIRN-CC), the US Department of Energy through grant DE-AC02-06CH11357, and the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by US National Science Founda-tion grant number ACI-1053575.

References

1. D. Lifka et al., XSEDE Cloud Survey Report, tech. report 20130919-XSEDE-Reports-CloudSurvey-v1.0, XSEDE, 2013.

2. T. Hey, S. Tansley, and K. Tolle, eds., The Fourth Paradigm: Data-Intensive Scientific Discovery,Microsoft Research, 2009.

3. R.L. Grossman et al., “The Design of a Com-munity Science Cloud: The Open Science Data Cloud Perspective,” Proc. 2012 SC Companion: High Performance Computing, Networking Stor-age and Analysis (SCC 12), 2012, pp. 1051–1057.

4. I. Foster, “Globus Online: Accelerating and De-mocratizing Science through Cloud-Based Ser-vices,” IEEE Internet Computing, vol. 15, no. 3, 2011, pp. 70–73.

5. R. Ananthakrishnan et al., “Globus Nexus: An Identity, Profile, and Group Management Plat-form for Science Gateways and Other Collabora-tive Science Applications,” Proc. IEEE Int’l Conf. Cluster Computing (CLUSTER), 2013, pp. 1–3.

6. B. Allen et al., “Software as a Service for Data Scientists,” Comm. ACM, vol. 55, no. 2, 2012, pp. 81–88.

7. W. Allcock et al., “The Globus Striped GridFTP Framework and Server,” Proc. 2005 ACM/IEEE Conf. Supercomputing (SC 05), pp. 54–64.

8. J. Novotny, S. Tuecke, and V. Welch, “An Online Credential Repository for the Grid: MyProxy,”

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







Proc. 10th IEEE Int’l Symp. High Performance Distributed Computing, 2001, pp. 104–111.

9. V. Welch, Globus Online Security Review, tech. report, Indiana Univ., 2012; https://scholarworks.iu.edu/dspace/handle/2022/14147.

10. Y. Su et al., “SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol,” Proc. Int’l Conf. High Per-formance Computing, Networking, Storage and Analysis (SC 13), 2013, article 47.

11. T.H. Vines et al., “The Availability of Research Data Declines Rapidly with Article Age,” Cur-rent Biology, vol. 24, no. 1, 2014, pp. 94–97.

12. A.A. Alsheikh-Ali et al., “Public Availability of Published Research Data in High-Impact Jour-nals,” PLoS ONE, vol. 6, no. 9, 2011, e24357.

KYLE CHARD is a senior researcher at the Com-putation Institute, a joint venture between the Uni-versity of Chicago and Argonne National Laboratory. His research interests include distributed meta-sched-uling, grid and cloud computing, economic resource allocation, social computing, and services computing. Chard received a PhD in computer science from Vic-toria University of Wellington, New Zealand. Contact him at [email protected].

STEVEN TUECKE is deputy director at the Univer-sity of Chicago’s Computation Institute, where he’s responsible for leading and contributing to projects in computational science, high-performance and distributed computing, and biomedical informatics. Tuecke received a BA in mathematics and computer science from St Olaf College. Contact him at [email protected].

IAN FOSTER is director of the Computation Insti-tute, a joint institute of the University of Chicago and Argonne National Laboratory. He is also an Argonne senior scientist and distinguished fellow, and the Ar-thur Holly Compton Distinguished Service Professor of Computer Science. His research interests include distributed, parallel, and data-intensive computing technologies, and innovative applications of those technologies to scientific problems in such domains as climate change and biomedicine. Foster received a PhD in computer science from Imperial College, United Kingdom. Contact him at [email protected].

Watts S. Humphrey Software Process Achievement AwardNomination Deadline: January 15, 2015

Do you know a person or team that deserves recognition for their process improvement activities?

The IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award is presented to recognize outstanding achievements in improving the ability of a target organization to create and evolve software.

The award may be presented to an individual or a group, and the achievements can be the result of any type of process improvement activity.

To nominate an individual or group for a Humphrey SPA Award, please visit http://www.computer.org/portal/web/awards/spa

IEEE Computer Society | Software Engineering Institute


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________

_______________________

______________

_____________

_________

_____

https://scholarworks.iu.edu/dspace/handle/2022/14147




http://www.computer.org/portal/web/awards/spa






Location-Based

Security Framework

for Cloud Perimeters

Chetan Jaiswal, Mahesh Nath, and Vijay Kumar, University of Missouri, Kansas City

any enterprise and government organiza-tions use a variety of mobile gadgets to connect to the cloud to manage their data processing requirements. Although this platform improves availability and perfor-mance, it also increases security risk, as

it can allow unwanted malicious network traffic into the organization. Firewall filtering is often inadequate for stopping these attacks. The problem becomes more complex when multiple firewalls are deployed because coordination among them becomes extremely difficult if not impossible.

Current firewalls use static filtering policies. Although simple, a static policy has many disadvantages. First, because border routers enforce a static policy, they can’t react to changes in the external en-vironment. Second, because of physical limitations and differences in trust relationships between an enterprise and its immediate neigh-bors, some firewalls might require preferential treatment over others in admitting different kinds of traffic streams. Therefore, providing perimeter protection policies that react to dynamic changes and re-spect organizational objectives such as preferential treatment while enforcing organizations’ overall security objectives requires dynam-ic and flexible policies at each border gateway that are also part of a global policy such that they enforce common security objectives in mobile clouds.

A new approach to

compose firewall

policies to protect

mobile and static

cloud perimeters

uses location to

filter out attacks

from unsafe

locations.

SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








The protection issue becomes more complex when we consider attacks from mobile sources. Unlike threats from stationary attackers, mobile attackers disappear from the attack location and resurface elsewhere. We introduce location-attack protection, in which the firewall can block messages from high cybercrime locations (country, state, and so on) completely. To enable this protection, we use constrained logic programming by appropriately al-tering the flexible authorization framework (FAF)1

and its extensions, and strand spaces and multiset rewriting strategies for protocol analysis.2–4 These two formalisms control multiple streams of data ex-changed between two participants, which is relevant to our framework because it requires fine-grained and protocol-specific perimeter protection policies. They can also easily incorporate new spatial and temporal parameters that are unique and crucial to mobile clouds. Our scheme supports consistency and completeness of local policies: they’re important because an individual gateway or firewall on the pe-rimeter needs to know whether to allow or deny a stream’s progress. The scheme makes sure that the composition of local policies is logically correct to ob-tain an enterprise-wide perimeter protection policy. In addition, our scheme makes sure that the effect of the propagation of change in policies to others is correct. If the global policy changes, then all lo-cal polices have to accommodate that change. Con-versely, if a local policy changes, the related global policy may change, which in turn may trigger chang-es in other dependent local policies.

A mobile unit frequently changes location, which can introduce inconsistency at a policy lev-el. For example, a mobile user might be subject to a different set of constraints in Kansas City, Kan-sas, than in Kansas City, Missouri, which will affect the data access pattern. Firewall filtering schemes must handle such policy changes, due to mobility and other necessary revisions in the policy in real time to eliminate false denial. This becomes tricky because a mobile unit becomes unreachable when it’s switched off or slips into doze mode:5 updates or changes can only be installed when the unit be-comes active. To address this problem, we use a twofold optimization strategy. In the first phase, we apply fold/unfold6 transformations to optimize policy rules. In the second phase, we partially materialize

static parts of individual polices, excluding dynamic variables that share information between local and global policies. We then translate such optimized policies to rule sets used by today’s firewalls.7

Mobile Cloud

Mobile clouds support personal and terminal mo-bility.5 A mobile unit can mount attacks from any location at any time. Attack packets pass through several gateways before reaching the cloud. Each gateway has its own dynamic firewall, and the cloud is protected by its own firewall. Whenever a firewall policy change is incorporated on any of the firewalls, the change is propagated to all other firewalls for updating.

Firewalls are typically configured using a rule base specifying which inbound or outbound packets (or sessions) are to be allowed or blocked. A Cisco rule set7 is as follows: pass tcp 20.9.17.8 0.0.0.0 121.11.127.20 0.0.0.0 range 23 27, which says that TCP packets from IP address 20.9.17.8 to IP address 121.11.127.20 are to be accepted if the destination port range is from 23 to 27. The 0.0.0.0 segments mean that address masking isn’t used. Generally, such rules are listed in some order in access lists.7

When a firewall receives a packet, it goes through the list and matches the first rule that applies to the packet and follows the specified action. Firewalls use a closed policy that drops packets not explicitly permitted by any rule. This procedure leads to sev-eral problems:

• Because the rules are written at the lower pro-tocol level, a misconfiguration can make the whole intranet unreachable.

• The rule base might have many redundant rules. • The semantics depend not only on the rules, but

also on the order in which they’re listed, an un-desirable feature.

Earlier research on this issue provided solutions with limited success. For example, Yair Bartal and his colleagues proposed Firmato, a firewall manage-ment toolkit.8 Although it models the firewall secu-rity policy and network topologies, it doesn’t permit fine-grained admission control of streams, doesn’t cover intranets with multiple external gateways that enforce different policies, and can’t be used to obtain

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

the global health of the traffic streams entering the Internet. Alain Mayer and his colleagues9 present Fang, a firewall analysis engine that has the same deficiencies as Firmto.8 Our scheme resolves these deficiencies. A cryptography-based scheme relies on decentralized trust management.10 Our solution distributes network perimeter protection without re-linquishing centralized control and thereby circum-vents the performance bottlenecks of a centralized perimeter protection policy.

Unlike wired systems, a mobile node can issue a request from any location, connect to many service providers that might have different security require-ments, slip into doze mode, power off, or fail. Mobile nodes are also vulnerable to attacks. A mobile cli-ent’s valid request from one location can be denied at another location. Several good schemes have been proposed for protecting mobile systems through firewalls11; however, they provide engineering solu-tions to firewall protection and appear highly system dependent.

We conclude the following: we need real-time synchronization of firewalls and subsequent up-dates; a multilayer verification is the way to go; and the system must implement geographical location-based verification. Our logic-based framework meets these requirements.

Flexible Perimeter Protection Framework

Because a common framework can protect mobile and wired traffic, we built a unified flexible frame-work to monitor and dynamically adjust the enter-ing datastreams. FPPF is based on having rules built with predicates to express policies for accepting

packets in an ongoing stream. We wrote the FPPF filter or protection rules in the Flexible Parameter Protection Specification Language (FPPL). We brief-ly introduce salient features of this language here; details can be found elsewhere.1

Example Policies Written in FPPL

FPPL, similar to other logical languages, consists of constant symbols, variables, function symbols, and terms. It uses a set of predicates to define packet ac-ceptance and rejection rules for local and enterprise-wide policies.

For example, the rules in Figure 1a define a local policy. Rule 1 says that stream Si can be opened if it has been permitted to do so, where permToOpen(Si)holds when the latter is true. Rule 2 says that the next packet of Si is admitted as long as Si isn’t blocked, the local packet acceptance and the global approval policies allow it, and the corresponding lo-cal and global statistics are updated. Rule 3 defines the condition for Si being blocked—namely, that the local variable Li (say buffer capacity allocated to this stream) has been used up by the stream up to now.

The local policy needs to know that the predi-cate gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+) (a part of enterprise-wide policy) is true for the next packet to be admitted according to the agreement it has with the enterprise-wide security policy. Con-versely, it’s obligated to update the global statistics updtGlobalStat(car(post), Si, [P1, . . ., Pn],+) that are a part of the enterprise-wide security policy. Note that other variables appearing in these two predicates, namely [P1, . . ., Pn], are unknown to the local policy,

Rule 1: procNxtPkt([], post, Si,+) ← permToOpen(Si)

Rule 2: procNxtPkt(pre*car(post), post, Si,+) ← blocked(Si)

PROVISION(self): updtLocalStat(car(post), Si ,Li),

localPktAcpPolicy(car(post), pre, Si,+),

PROVISION(global): gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+)

OBLIGATION(global): updtGlobalStat(car(post), Si, [P1, . . ., Pn],+)

Rule 3: blocked(x) ← Li = maxL i

(a)

Rule 4: procNxtPkt([], post, Si,+, LOi) ← permToOpen(Si, LOi)

Rule 5: procNxtPkt(pre*car(post), post, Si,+) ← blocked(Si, LOi)

PROVISION(self): updtLocalStat(car(post), Si, Li),

localPktAcpPolicy(car(post), pre, Si,+, LOi),

PROVISION(global): gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+, LOi)

OBLIGATION(global): updtGlobalStat(car(post), Si, [P1, . . ., Pn],+,LOi)

Rule 6: blocked(x) ← Li = maxLi

(b)

FIGURE 1. Packet acceptance and rejection rules written in Flexible Parameter Protection Specification

Language (FPPL) for: (a) local and enterprise-wide policies, and (b) mobile policies.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________







so they can’t be used in the rule as a normal predi-cate. The enterprise-wide policies are composed accordingly.

Enhancing Provisions, Obligations, and

Delegations

Provisions and obligations play a key role in the FPPF architecture. As the previous example demonstrates, local policies depend on having provisions approved by the global policy base, and, in turn, local policies are obliged to update their local statistics with the global policy base. This two-way exchange of data al-lows the global policy to respond to perimeter-wide changes in an accurate and timely manner.

In our case, the provision granted by the glob-al policy base to the local policy is specified in the predicate gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+).Therefore, we model a provision as a predicate ex-ported by the grantor and imported by the grantee. The main characteristic of the provision is that the grantee doesn’t know its definition, but would know if it’s evaluated to be true or false.

The obligation used in the example is the predi-cate updtGlobalStat(car(post), Si, [P1, . . ., Pn],+).Note that this is also a predicate that’s exported to the local policy base, which must instantiate the proper instances of variables that would make the predicate instance true. This obligation can be ful-filled when the function call is made.

Policy Updates in Mobile Systems

Geographical location plays an important role in managing mobile activities (such as an attack). We include geographical locations in the predicates used to specify local and enterprise-wide policies. The firewall policy in the cloud will depend on where Si originates (location-specific attacks). Thus, the predicates for the local policy will include attacker’s location, and the predicates for the enterprise-wide policy will depend on a set of locations where a mo-bile unit is permitted to roam.

We illustrate mobile policy composition with a modified rule and example (Figure 1b).

In this example, rule 4 says that Si can be opened if it originated at location LOi (longitude), and if it has been permitted to do so, where permToOpen(Si)holds when the latter is true. Rule 5 says that the next packet of Si is admitted provided that Si isn’t blocked, local packet acceptance and the global ap-proval policies allow it, and the corresponding local and global statistics are updated. Rule 6 defines the condition for Si being blocked—namely, that the lo-cal variable Li (say, buffer capacity allocated to this stream) has been used up by the stream up to now.

In mobile attacks, the attack location can change frequently. As a result, firewall policies can change frequently, leading to increased update traf-fic, which might not be able to handle such frequent updates and might not be able to keep the local and global policies in sync. Our scheme addresses this issue by keeping the policy warehouse at all base sta-tions. Because a base station serves a specific loca-tion, policy relevant to that location is loaded there. A mobile unit will cache the policy from the base station of the cell it’s visiting. A base station will broadcast policy changes as they occur, and all mo-bile units in that cell will capture it and visitors to that cell will acquire it when they register.

Unsafe Locations

In our experience, more attacks (serious or less se-rious) come from some locations than others. The predicates coded in the firewalls in our system in-clude a location parameter to identify an attack’s origin. If it originates from an “unsafe” location, it’s blocked. We define three categories of unsafe locations.

A hard location is one from which numerous serious attacks originate with high frequency. Any of these attacks can severely affect the cloud’s per-formance and integrity. The firewall must stop these attacks. If the firewall detects that an attack (for ex-ample, a Trojan) is mounted from a serious location, it immediately eliminates this attack.

A soft location is one from which relatively few-er serious attacks originate. These types of attacks don’t significantly affect the cloud’s performance and integrity, and the cloud system can continue to function while the firewall handles the attack. For example, a music sharing virus can scare people without harming the computer. The firewall might let it enter the cloud.

Finally, a clean location is one from which no attacks originate. The firewall might apply mini-mal security checks to messages coming from these locations.

Unsafe Location Identification in Mobile

Communication

When an attacker moves around in a location while attacking the cloud, it will have the same IP address at different points inside the location. For example, if an attacker moves from point li to point lj inside a location L, the IP address will not change, that is, points li and lj will have the same IP address even though their geographical address (point li to point lj) inside L will change. Thus, to hide their identity and avoid being caught in such movement, attackers

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

generally use a proxy. The tracert (trace route) com-mand, which shows the path an IP packet travelled to reach a destination, isn’t helpful because it can’t go beyond the proxy.

In a mobile network, IP address allocation is dynamic, so it’s easy for an attacker to spoof an IP address and mount an attack through a proxy. To reach the attacker behind the proxy, our approach identifies a mobile phone’s location in a cellular network through its cell global identity (Figure 2).

At present, cell global identity information is not available in IP packets coming from a mobile device. Our scheme (Figure 2) extends the structure of an IP address and includes cell global identity infor-mation in IP packets. This helps us to identify the location of the mobile unit mounting the attack, di-rectly or through a proxy, and to program the fire-wall accordingly to block the attack. For example, if the cell global identity in an IP packet is mobile country code (MCC) = 310 (indicates USA) and mo-bile network code (MNC) = 410 (AT&T network), location area code (LAC) = 3450 and cell identity (CI) = 118541125 represents a cell in Kansas City, Missouri.

As a security measure, the system maintains lo-cation area codes of unsafe locations and discards incoming packets from these unsafe locations with-out even analyzing them. To determine if a location is safe or unsafe, the system records the number of attacks from each known location. If the num-ber of attacks from a particular location reaches a threshold, it marks the location as unsafe. Figure 3 illustrates our approach for determining unsafe locations. It’s an ongoing process of finding unsafe locations based on the number of attacks originating from a specific location. We maintain a database of unsafe locations.

The serving GPRS support node (SGSN) is the main component of the General Packet Radio Ser-vice network. The SGSN can make this location information available in IP packets coming from mobile stations because it has access to the location information (CGI) of a mobile station in its area and is also responsible for delivering data packets to and from the mobile stations in its area. The IP pack-et header, which contains an optional field, can be used to store the cell global identity.

Figure 4 shows the flow of IP packets in the Global System for Mobile Communication (GSM) and Universal Mobile Telecommunications System (UMTS) architectures. We’ve included rele-vant elements of GSM in our scheme for identifying unsafe locations. On the receiving end, the firewall extracts the CGI information from the enhanced IP packet and searches hard unsafe location (HUL) and mild unsafe location (MUL) lists and decides whether to reject or allow the packet.

Because our approach can identify the attacker’s location, it compromises the attacker’s privacy. Al-though this is not an issue in the case of an attacker, our scheme should be able to protect the privacy (which could lead to a security breach) of a typical user if that user’s actions look like an attack. We’re investigating a solution to make sure that the loca-

CI: Cell identity LAC: Location area codeMCC: Mobile country code (3-digit)MNC: Mobile network code (2 or 3 digit for GSM/UMTS

application)

MCC MNC CILAC

Location area identification

Cell global identification (CGI)

FIGURE 2. Cell global identitification.

Discardpacket

Allow packet

Add LAC to MUL

Discard packet

Increment MULcounter

Remove LACfrom MUL and

add LAC to HUL

Yes No

Start

HUL: Hard unsafe location listMUL: Mild unsafe location a final list of unsafe

locationsT: Attack threshold from a particular location

End

No Yes

No

No

Yes

Yes

Malicious?

Check thepacket

LAC in HUL?

MUL counter> + T?

LAC in HUL?

FIGURE 3. Determining unsafe locations.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________







tion isn’t accessible to anybody or only its encrypted version is accessible.

Unsafe Locations Identification in Wired

Networks

Because our scheme for identifying unsafe locations in a cellular platform won’t work for stationary at-tackers, we developed a different solution for these attackers. We use known landmarks (that is, trusted computers on the Internet whose geographic coor-dinates are known a priori) to probe an attacker’s machine (at an unknown location) and measure the response delay to compute attacker coordinates. We repeat this process until we identify the location as accurately as possible.

The packet transfer delay is directly propor-tional to the distance. Line congestion, queuing de-lays, and so on can affect this relationship, but by consolidating several observations, we can identify the location with reasonable accuracy. We probe the destination from several landmarks to get a de-lay vector, which we then use to get an overlap area as the destination. When we have many landmarks, we can triangulate the results to determine the at-tack’s geoposition. In our approach, we first create a large dataset of real-world measurements by mea-suring latency from each landmark to every other landmark. We used PlanetLab (www.planet-lab.org) and a carefully selected geographically diverse set of landmarks across the globe. We used triangula-tion to implement our algorithm, which uses three out of 50 landmarks as the starting point. It iterates with three different landmarks until we obtain con-sistent results. Our observation concurs with the other work.12

Our algorithm (see the sidebar) starts with a set of continental landmarks (CLMs) measuring the delay in reaching the attacker machine (AMX). We considered the landmark at the University of California, Berkeley, as the west CLM (CLM2)and the landmark at Michigan Technological Uni-

versity as the north CLM (CLM3). The algorithm first randomly selects three landmarks, each from a different CLM dataset (CLM1, CLM2, CLM3, or CLM4). In the next step, each LM-CLMi (where 1 ≤ i ≤ 3) individually measures the delay to AMX.Using the AvgLDelayi (distance to delay measure-ments based on average lowest delay between any two landmarks), each LM-CLMi estimates the dis-tance to AMX.

We create AvgLDelay for each LM-CLM, which provides the average ratio of distance to delay (DDR). After estimating the distance of AMX from the three LM-CLMs, our algorithm ascertains the geolocation (AMLA, AMLO) of AMX. In step 5, the algorithm considers the area (Zonal_Region) sur-rounding (AMLA, AMLO), called the initial zone.After identifying the initial zone, it creates the AvgLowestNodetoZoneDelay (dataset of distance-to-delay measurements based on the average lowest delay between a particular node to the zone with AM as the given latitude and longitude) for each se-lected LM-CLMi on the fly. In the next step, each LM-CLMi individually measures the delay to AMX.Using AvgLowestNodetoZoneDelayi, each LM-CLMi

estimates the distance to AMX. After estimating the distance of AMX from the three LM-CLMs, the al-gorithm ascertains the new geolocation (AMLA1,AMLO1) of AMX. In step 8, it finds the set of zonal landmarks (ZLMs) in the zonal region (initial value ±4º) around (AMLA1, AMLO1), which we call the fi-nal zone.

It’s important to find landmarks that are diverse with respect to each other as well as to (AMLA1,AMLO1). Once the final zone is identified, the algo-rithm creates AvgLowestZonalDelay on the fly by considering the prerecorded minimum delays from each LM-ZLMi to LM-ZLMj (∀i, j: i ≠ j, 1 ≤ i ≤ n, 1 ≤ j ≤ n, each LM-ZLM ∈ Final_Zone where n is the total number of landmarks in the final zone). This dataset provides the final zone’s DDR. In the next step, each LM-ZLMi measures the delay to AMX.

RNC

SGSN GGSN

Global identity

IP pocket

Internet

IP packet + CGI

BSC

Firewall

Receiving end

Cloud

HUL

MULIP packet + CGI =

BSC: Base station controllerGGSN: Gateway GPRS support nodeRSN: Radio network controllerSGSN: Serving GPRS support node

FIGURE 4. Partial network architecture: GSM + GPRS + UMTS.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://www.planet-lab.org






SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

ALGORITHM STEPSHere, we describe the steps in our algorithm. We

consider two cases: the general case, in which

more than three landmarks are required to ascer-

tain the geolocation; and the best case, in which

only three landmarks are required to ascertain the

geolocation.

Total number of landmarks (LMs) = N

1. Select any three LMs (LM-CLM1, LM-CLM2, LM-

CLM3) from the CLM sets (Figure A).

2. Calculate AverageLDelay at each LM-CLMi to at-

tacker machine AMX.

3. Estimate distance (CLM-Disti) from each LM-CLMi

to AMX based on AvgLDelay.

4. Ascertain the location of AMX using trilateration as

(AMLA, AMLO).

We used the great circle and aviation formulas1 to

perform our calculations. We know the length of the

sides of triangle ABC (Figure A1), which is the distance

between the known landmarks. Using the lengths

(AB, BC, CD), angles, and the estimated lengths of

AD, BD and CD, we apply the triangulation to get the

coordinate of point D (AMLA, AMLO).

5. Find all LMs in ±4º (Zonal_Region) of (AMLA,

AMLO). We refer to this as the Initial_Zone.

We refer to “find all the LMs in ±4º (Zonal_Region)

of (AMLA, AMLO)” as Initial_Zone. Because we know the

geolocations of all landmarks, our algorithm finds the

landmarks in the Initial_Zone (Figure A2).

6. Create AvgLowestNodeToZoneDelay.

7. Estimate distance (CLM-Disti) again from each LM-

CLMi to AMX based on AvgLowestNodeToZoneDelay.

8. Ascertain AMX location using triangulation as

(AMLA1, AMLO1).

The algorithm calculates the new geolocation of

AMX based on current zonal DDR (Figure A3). D1 is the

new geolocation of AM (AMLA1, AMLO1).

9. Find all LMs in the Zonal_Region of (AMLA1, AMLO1).

Call this Final_Zone if Total_LMs in Zonal_Region

< 3. Then exit with result as (AMLA1, AMLO1).

10. Select any three LMs (LM-ZLM1, LM-ZLM2, LM-

ZLM3) from Final_Zone based on the top values

of Diversej = ABS (LM-ZLMjLA – AMLA1, LM-ZLMj

LO

– AMLO1) and ABS(LM-ZLMjLA – LM-ZLMi

LA, LM-

ZLMjLO – LM-ZLMi

LO).

Figure A4 shows the zonal landmarks (ZLMs) of the

Final_Zone. These are the LMs in the Zonal_Region of

(AMLA1, AMLO1). Because we only need three LMs in

each iteration, our algorithm computes the diversity

parameter for all the LMs in the zone and, based on

this parameter, selects the three LMs with the highest

values (Figure A5).

11. Create AvgLowestZonalDelay.

12. Calculate AverageOfLowest delay at each LM-

ZLMi to AMX and estimate distance from

each LM-ZLMi to AMX based on Dataset

_AvgLowestZonalDelay.

13. Ascertain AMX location using triangulation as

(AMLA2, AMLO2) (Figure A6).

After step 13, there are two locations of AMX as

(AMLA1, AMLO1) and (AMLA2, AMLO2).

14. Set Zonal_Region = Zonal_Region – Closing

_Factor.

15. Compare (AMLA1, AMLO1) and (AMLA2, AMLO2).

If the result = satisfactory or Zonal_Region = 0º,

then exit with result as (AMLA2, AMLO2).

If the result is satisfactory in the first iteration,

the algorithm terminates. Thus, steps 9 through 15

validate the results. This is the best case because

only three LMs can identify the AMX location ac-

curately in just one iteration. Because the result

is unsatisfactory, the algorithm iterates steps 9

through 16, as follows:

16. Goto step 9 with (AMLA1, AMLO1) = (AMLA2, AMLO2).

Reference

17. E. Williams, Aviation Formulary V1.46, http://wil-

liams.best.vwh.net/avform.htm.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______

_____________________

_________________________


http://williams.best.vwh.net/avform.htm






Estimated locationof attacker, D

D

B

B

(1) (2)

(3) (4)

(5) (6)

C

A

A

C

Estimated locationof AM, D

Select ZLM

AMLAT1

, AMLON1

ZLM

AMLAT1

, AMLON1

ZLM

AMLAT

, AMLON

Select ZLM

AMLAT2

, AMLON2

D1

C

D2B

A

FIGURE A. Landmark selection and attacker location estimation: (1) calculate lengths of the sides of triangle ABC; (2) find the

landmarks in the Initial_Zone; (3) calculate the new positions of the attacker machine AMX; (4) determine zonal landmarks

(ZLMs) of the Final_Zone; (5) select the three landmarks (LMs) with the highest values; and (6) ascertain the attacker

machine location using triangulation.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

By using AvgLowestZonalDelayi each LM-ZLMi es-timates the distance to AMX. After estimating the distance of AMX from the three LM-ZLMs, the al-gorithm ascertains the new geolocation (AMLA2,AMLO2) of AMX. It then compares the two geoloca-tions (AMLA2, AMLO2) and (AMLA1, AMLO1) for er-ror distance.

If the error distance is less than 10 miles, we consider the result satisfactory, and the algorithm terminates with (AMLA2, AMLO2) as the final geo-location of AMX. It can also terminate when the zonal region reaches zero or the total number of landmarks in the Zonal_Region is less than 3. In other cases, the algorithm continues to iterate until it reaches the desired location accuracy.

ur current work will provide us with a plat-form for dealing with the security of global

cloud structure (linking all customers who rent cloud services). At present, banks are reluctant to use cloud services because they don’t know the whereabouts of datacenters. Our system will identify a datacenter’s geographical location and dynamical-ly manage the firewall protecting it. This approach will largely relieve cloud service providers from the responsibility of securing their datacenters.

Acknowledgments

We thank Sushil Jajodia for his highly useful sug-gestions, which helped us improve the algorithm for firewall composition. US National Science Founda-tion grant CNS-1347958 supported this research.

References

1. S. Jajodia et al., “Flexible Support for Multiple Ac-cess Control Policies,” ACM Trans. Database Sys-tems (TODS), vol. 26, no. 2, 2001, pp. 214–260.

2. I. Cervesato et al., “Relating Strands and Mul-tiset Rewriting for Security Protocol Analy-sis,” Proc. 13th Computer Security Foundations Workshop (PCSFW 00), 2000, pp. 35–51.

3. F.J. Fabrega, J.C. Herzog, and J. Guttman, “Strand Spaces: Why Is a Security Protocol Cor-rect?” Proc. IEEE Symp. Security and Privacy,1998, pp. 160–171.

4. J. Loeckx and K. Sieber, The Foundations of Pro-gram Verification, John Wiley & Sons, 1987.

5. V. Kumar, Mobile Database Systems, John Wiley & Sons, 2006.

6. H. Seki, “Unfold/Fold Transformation of Strati-fied Programs,” Theoretical Computer Science,vol. 86, no. 1, 1991, pp. 107–139.

7. Cisco ISO Lock and Key Security, white paper, Cisco Systems, 1996.

8. Y. Bartal et al., “Firmato: A Novel Firewall Man-agement Toolkit,” Proc. IEEE Symp. Security and Privacy, 1999, pp. 17–31.

9. A. Mayer, A. Wool, and E. Ziskind, “Fang: A Firewall Analysis Engine,” Proc. IEEE Symp. Se-curity and Privacy, 2000, pp. 177–187.

10. S. Ioannidis et al., “Implementing a Distrib-uted Firewall,” Proc. ACM Conf. Computer and Comm. Security, 2000, pp. 190–199.

11. E. Goren and O. Duskin, “Mobile Firewall,” in-ternal report, Check Point Software Technolo-gies, Hebrew Univ.

12. M. Gondree and Z.N.J. Peterson. “Geolocation of Data in the Cloud,” Proc. 3rd ACM Conf. Data and Application Security and Privacy, 2013.

CHETAN JAISWAL is a PhD scholar at the Univer-sity of Missouri, Kansas City. His research interests include cloud computing; mobile, wireless sensor networks, and cloud security; and cloud-based data-base transaction systems. He is also passionate about programming, learning new concepts, and teaching. Contact him at [email protected].

MAHESH NATH is a PhD scholar at the University of Missouri, Kansas City. His research interest include network and information security and privacy, with an emphasis in next-generation firewall frameworks. Contact him at [email protected].

VIJAY KUMAR is the Curator’s Professor in the computer science department at the University of Missouri, Kansas City. His research interests include information security, wireless and mobile computing, and database systems, with particular emphasis in cy-bersecurity and wireless data dissemination. Kumar has a PhD in computer science from Southampton University, England. Contact him at [email protected].


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________________

________________

__________

___

_________________________











Multilabels-Based

Scalable Access

Control for Big Data

Applications

Hongsong Chen, University of Science and Technology Beijing

Bharat Bhargava, Purdue University

Fu Zhongchuan , Harbin Institute of Technology

A multilabels-based access control model uses

different labels to provide scalable granularity access

protection to big data applications.

ig data refers to datasets that are too large for typical database software tools to capture, store, manage, and analyze.1 Forrester Research defines big data as “a set of skills, techniques, and technologies for handling data on an extreme scale with agility and affordability.”2 In 2012, Gartner de-fined big data as “high volume, high velocity, and/or high variety infor-mation assets that require new forms of processing to enable enhanced

decision making, insight discovery and process optimization.”3

Big data is used in many critical areas, such as online social networks and mining, healthcare information systems, physics, e-commerce, sensors and remote sensing, and

SEC

UR

E B

IG D

ATA

IN T

HE

CLO

UD

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

the life sciences. In 2012, the Obama administration announced the “Big Data Research and Development Initiative,” which explored how big data could be used to address important problems faced by the gov-ernment.4 Big data extends traditional data process-ing techniques in three dimensions: volume, velocity, and variety. Every day, all kinds of data sources gen-erate 2.5 quintillion bytes of data. Time-sensitive data processes, such as online financial transaction systems, require real-time response. Data types and structures are variable in big data, which can include unstructured and uncertain data.5

Some big data applications could improve the economic benefits by correctly and efficiently ap-plying big data and data processing techniques. For example, Forrester Research estimates the potential value for the US healthcare information system of using big data effectively could be more than $300 billion per year.2 However, at the same time, re-search challenges are emerging from issues such as data heterogeneity, inconsistency, incompleteness, timeliness, security and privacy, visualization, and collaboration.6

Regardless of whether the concern is security threats to big data or security protections for big data, security and privacy issues must be solved ef-ficiently and in a timely manner. To address these security and privacy challenges, we propose a mul-tilabels-based access control model that provides flexible security protection to big data. Our scalable access control model uses labels to provide scalable granularity access protection to a big data applica-tion in the healthcare area.

Security and Privacy Challenges in Big Data

Applications

Multiple data sources, multiple data formats, and multiple user types introduce new security challeng-es to access control models for big data applications. Sensitive data faces many threats, such as informa-tion leakage, unauthorized access, and tampering. Various methods are used to provide security and privacy protection of big data.

The Cloud Security Alliance (CSA) highlights the top 10 big data security and privacy challenges7: se-cure computation in distributed programming frame-works, security best practices for non-relational data stores, privacy-preserving data mining and analytics, cryptographically enforced data centric security, gran-ular access control, secure data storage and transac-tions logs, granular audits, data provenance, end-point validation/filtering, and real-time security monitoring. We classify these challenges into four categories (see Figure 1): infrastructure security, data privacy, data management, integrity, and reactive security.

CSA explains each challenge from four view-points: use case, modeling, analysis, and implemen-tation. The challenges differ from those of traditional data security because of the characteristics of big data applications. The top 10 challenges are inter-related and involve security and privacy problems in big data collection, transfer, storage, and process. Granular access control to different data sources and entities is a foundational problem in these security challenges. Thus, we must reconsider the data access control model and redesign it to adapt to the variable access control requirements of big data applications.

Infrastructure security Data management Integrity and reactive security

Secure computationsin distributedprogrammingframeworks

Security bestpractices fornonrelationaldata stores

Data privacy

Privacy-preservingdata miningand analytics

Cryptographicallyenforced data-centric security

Secure datastorage and

transaction logs

End-point validationand filtering

Real-time securitymonitoringGranular audits

Dataprovenance

Granularaccess control

FIGURE 1. Classification of the Cloud Security Alliance top 10 big data security and privacy challenges.7 The four categories

provide security and privacy protection at different levels. (©2013 Cloud Security Alliance. Used with permission.)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________







Big data applications should be compliant with relevant rules and regulations,7 such as the Health In-surance Portability and Accountability Act (HIPAA), Sarbanes-Oxley Act (SOX), Payment Card Industry Data Security Standard (PCI-DSS), ISO/IEC 27002,8

Federal Information Security Management Act (FIS-MA), and the EU Data Privacy Directive.

Consider, for example, Google Flu Treads, a big data application that’s based on large numbers of Google search queries.9 GFT uses the IP address as-sociated with each search query to determine where the query originated. Using this method, the appli-cation could use search queries to detect influenza epidemics. Because GFT handles many people’s health status and Google recognizes the impor-tance of people’s privacy, none of the queries in the GFT database can be associated with a particular user. The GFT database doesn’t retain information about user identity, IP address, or physical location. All of the project’s data is used in accordance with Google’s Privacy Policy. The GFT project can not only predict flu trends, but it can do so without vio-lating people’s privacy. To protect user privacy and its own business secrets, Google doesn’t make its query data public. Therefore, other researchers can’t use the data for research or for predicting flu trends.

Given this analysis, the access control model is a basic and critical security protection that can af-fect big data integration, integrity, confidentiality, and availability. The problem of access control in big data is deciding how to select and control data ac-

cess granularity according to the application’s secu-rity requirements. See the sidebar for a discussion of related access control models.

Multilabels Structure in an HDFS Application

HIPAA requires personal health records (PHRs) to be stored securely. PHRs contain important and sensitive data for patients, doctors, health insurance companies, and healthcare institutions. As PHRs in-crease in volume and data type, they will aggregate to become an important source of big data. To pro-tect patients’ privacy and control access granularity to this data, we propose a multilabels-based scalable access control framework that protects sensitive data stored in the Hadoop Distributed File System. HDFS, which was originally built for the Apache Nutch Web search engine, is designed to store big data in a cloud computing environment. As Figure 2 shows, HDFS has a fault-tolerant storage policy and discretionary access control (DAC).10 However, DAC is insufficient for variable big data applications, es-pecially for security- and privacy-sensitive applica-tions. In our model, access control granularity varies with the number of multilabels and their content. The data owner can use the labels selectively, and the system administrator and designer can add, de-lete, or revise the labels according to the applica-tion’s security requirements.

As Figure 2 shows, an HDFS cluster consists of a NameNode, which manages the file system’s metada-ta, and DataNodes, which store the datasets. Clients

Metadata ops

Block ops

Replication

Rack 1 Rack 2Write

Datanodes DatanodesRead

Blocks

Metadata (name, replicas, ... ):/home/foo/data, 3, ...Namenode

Client

Client

FIGURE 2. The Hadoop Distributed File System architecture.10 HDFS has a fault-tolerant storage policy and

discretionary access control (DAC). (©2013 Apache Hadoop. Used with permission.)

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

ACCESS CONTROL MODELS FOR BIG DATA APPLICATIONS

adoop is an open source framework for big data

storage and processing that is widely used in

Facebook, Yahoo, e-Bay, LinkedIn, and other scientific

and big data applications. Several researchers have pro-

posed access control models for these applications.

Chunming Rong and his colleagues propose an

access control scheme with secure sharing storage

based on Hadoop.1 Their scheme includes four phas-

es: creating the access token, distributing the access

token, gaining the access token, and accessing blocks.

To access a data file, a user sends a request to the

Hadoop client inquiring about the file’s owner. The

Hadoop client then sends a response to this user as

well as the cloud storage provider according to the

secure sharing storage rules over the cloud. The user

downloads the re-encrypted token file and decrypts

the access token data to obtain the access token. The

user can access the data stored in the Hadoop data

nodes using the decryption metadata information. This

access control scheme depends on secure sharing

storage and metadata encryption/decryption. Because

the metadata is encrypted, servicing a huge number of

cloud users will affect the efficiency of the metadata

encryption/decryption and key management.

Patricia Ortiz and Oscar Lázaro present a multido-

main access control method that combines Extensible

Access Control Markup Language (XACML) role-based

access control models with Simple Protocol and RDF

Query Language (SPARQL) query rewriting capabili-

ties.2 They describe a mobile application scenario

in which a user from domain A wants to access the

resource from domain B. In their access control archi-

tecture, each domain contains a domain access server

(DAS). The DAS serves the access control systems.

Because the architecture includes many domains, lay-

ers, and access control modules, it might be difficult to

adapt to big data applications.

Kan Yang and his colleagues propose enabling

access control using dynamic policy updating for big

data in the cloud. They developed an outsourced pol-

icy-updating model based on an adapted ciphertext-

policy attribute-based encryption (CP-ABE) method.3

Their cloud storage system has multiple authorities.

The system model consists of four entities:

authorities (AA), a cloud server (server), data owners

(owners), and data consumers (users). The security

scheme includes five phases: system initialization, key

generation, data encryption, data decryption, and

policy updating. The authors introduce two types

of access structures that are used in constructing

attribute-based encryption (ABE) schemes: linear

secret-sharing schemes (LSSS) structure and access

tree structure. Although they propose a method for

updating the policy, the overhead for this updating

should be precisely evaluated when the data volume

is huge.

Online social networks (OSNs) have become

important sources of big data. In January 2013,

Facebook released Graph Search,4 which lets users

control their access policies using their relationships

in traditional security models. Jun Pang and Yang

Zhang claim that users can exploit public information

to adjust access permissions in OSNs.4 They use a

novel OSN model that includes both a user graph and

a public information graph. They developed a hybrid

logic that can express fine-grained access control

policies based on user and public information. Nev-

ertheless, allowing users to trust and accept public

information remains an important problem.

References

1. C. Rong, Z. Quan, and A. Chakravorty, “On Access

Control Schemes for Hadoop Data Storage,” Proc.

IEEE Int’ l Conf. Cloud Computing and Big Data

(CloudCom-Asia 13), 2013, pp. 641–645.

2. P. Ortiz et al., “Enhanced Multi-domain Access

Control for Secure Mobile Collaboration through

Linked Data Cloud in Manufacturing,” Proc. IEEE

Symp. and Workshop World of Wireless, Mobile and

Multimedia Networks (WoWMoM), 2013, pp. 1–9.

3. K. Yang et al., “Enabling Efficient Access Control

with Dynamic Policy Updating for Big Data in the

Cloud,” Proc. Infocom, 2014, pp. 2013–2021.

4. J. Pang and Y. Zhang, “A New Access Control Scheme

for Facebook-style Social Networks,” arXiv preprint

arXiv:1304.2504, 2013.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_________________________







connect to the NameNode to access file metadata and execute actual I/O operations on the DataNodes. Because Hadoop is open source, we can extend the metadata in the NameNodes to store labels. Que-ryIO, a Hadoop-based big data query and analytics tool, provides manual and automated data tagging functions that let users define properties for files when the data is written to HDFS (see http://queryio.com/product/big-data-analysis.html). Data owners can therefore manage their big data easily. QueryIO provides data tagging and metadata extension ser-vices to structured and unstructured big data. It en-ables users to define additional metadata (data tags) to extend the metadata layer. Both HDFS and Que-ryIO are implemented in Java, so we can implement the multilabels access control framework using the Hadoop API, QueryIO interface, and Java API.

The PHR data storage system includes two main types of operations—write and read. When data owners want to write data to HDFS, they cre-ate multilabels and write the data into the HDFS metadata associated with the PHR data. These mul-tilabels adjust to the PHR’s security needs, and can include Hadoop original metadata, data type, secu-rity level, lifetime, number of replications, access policy, and hash value. Figure 3 shows the multila-bel data structure.

In Figure 3, metadata represents the origi-nal Hadoop metadata; data type refers to the basic data type, such as PDF, Word, Openoffice, XML, HTML, or image; security degree is the security and protection level; lifetime refers to how long the data exists in HDFS; replication number is related to data dependability; and access policy refers to the access control rules for the data. Access control rules can be DAC, Bell and LaPadula (BLP), Bipa, RBAC, or attribute-based access control (ABAC). DAC is HDFS’s default access control policy. BLP and Biba are mandatory access control (MAC) poli-cies. BLP policy ensures data confidentiality and is characterized by the phrase, “no read up, no write down”; whereas Biba policy ensures data integrity, and is characterized by the phrase, “no write up, no read down.” RBAC is widely used in organizations and enterprises.

The access policy can be unstructured data, such as a tree-based access policy. In our multilabel-

based access control model, every label can be used to control data access granularity. Security degree and access policy can vary with data type. For ex-ample, we can assign an image from a health exami-nation a high security degree, while setting hospital and doctor introductory information to a low securi-ty degree. We can set the lifetime label to one hour, one day, one month, or “permanent.” Security and cost are related to the data lifetime. If the lifetime is expired, the data is deleted from the HDFS. This model is similar to temporal attribute-based access control. We use the hash value to protect the multi-labels’ integrity.

After the data owner sets the multilabels, the security access control scheme will protect them by preventing attackers from tampering with them or the data. The multilabel access control concept is similar to the active bundle concept.11 However, multilabels-based access control mode differs from active bundles, because the model doesn’t include virtual machines. Therefore, this model will provide greater security, while extending the active bundles approach’s metadata through multilabels, and ap-plying it in the Hadoop big data application. For big data applications that don’t deal with PHR storage, the multilabel number and content can differ. For example, a security administrator can add a risk la-bel to set the security risk threshold. When a user tries to access the data, the security agent evaluates the access risk using a security evaluation algorithm. If the agent determines that the risk value is less than the risk threshold, it permits the access; oth-erwise, it denies access. In this way, the multilabels access control model is scalable and configurable to different big data applications.

Personal Health Record Big Data Storage

Application

Because the PHR data storage application requires privacy protection, we use it to demonstrate the multilabels-based scalable access control model. Depending on the security requirements, we can tag PHRs with different data type labels (such as medical examination image, medical record PDF document, or patient information XML) and set the security degree label to high secret, middle secret, low secret, confidential, or unclassified. This con-

Metadata Meta typeSecuritydegree Life time

Replicationnumber Access policy Hash value

FIGURE 3. Structure of multilabels. Multilabels can include Hadoop original metadata, data type, security level,

lifetime, number of replications, access policy, and hash value.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://queryio.com/product/big-data-analysis.html






SEC

UR

E B

IG D

ATA

IN

TH

E C

LOU

D

cept is similar to service-level agreements (SLAs), which can be used in cloud computing and big data. Because healthcare information uses different data types, every data type has a different security re-quirement and can be related to the security degree.

For example, we could set an XML file that in-cludes a patient’s social security number (SSN) to high secret; PDF files containing medical examina-tion, symptom, and prescription information to se-cret; an HTML file containing an introduction to the doctor to unclassified; and the Word file with the patient’s medical reports to middle secret. The lifetime should be set for a different effective period because of the different data types and security re-quirements. For example, we can set a medical ex-amination report’s effective period to six months, a patient’s name to five years, a doctor’s name to 10 years, and an SSN number to 20 years. The number of replications can range from 1 to 5. As this num-ber increases, data dependability improves, but the user’s cost also increases, because the user will need additional storage spaces.

We compute the PHR hash value using the SHA1 hash algorithm.

Every entity in a healthcare information system can be granted a security privilege level—such as high, medium, common, or low—depending on the security degree label in the HDFS metadata. The en-tity’s security privilege level is relative to its role in the healthcare information system. For example, a gov-ernment health institute can be set to high privilege; patients and doctors can be set to medium privilege; a health insurance company can be set to low privilege; and a pharmacy can be set to common privilege. In a MAC model, different privilege entities have different access permissions to the same data.

When creating PHR data, the data owner can select and set labels, such as data type, security de-gree, lifetime, access control policy, and number of replications. The security agent then computes the hash value using the SHA-256 algorithm, and all the labels are attached in the PHR data’s HDFS meta-data. Thus, the multilabels access control approach combines active bundle, RBAC, ABAC, DAC, and MAC, providing flexible and scalable access con-trol policies to different data and entities. Before the PHR data is written into the HDFS system, the PHR data owner should be authenticated by Kerbe-ros protocol.

When an entity wants to read or write data in HDFS, it should be authenticated by the Kerberos protocol. The entity is then granted a security privi-lege level according to its role in the healthcare in-formation system. The Hadoop client will check the

multilabels in the HDFS metadata that it accesses. If the entity’s attributes and the PHR multilabels meet access control rules, the entity is granted ac-cess to the PHR data; otherwise, it’s denied access. QueryIO is especially well suited to enabling users to process unstructured big data. It gives structure and supports querying of big data applications. Que-ryIO makes the metadata extension in HDFS feasi-ble. The output writer and input reader class should be reconstructed to realize multilabels’ writing and reading function. QueryIO can help to extract struc-tured information from unstructured data and con-struct a data type label.

n the near future, we will use a software engi-neering method to realize the multilabels scalable

access control model for the PHR healthcare system using the Hadoop open source software and Que-ryIO software tool.

Acknowledgments

The Beijing Natural Science Foundation (no. 4142034), Beijing Science and Technique Plan Proj-ect (no. D141100003414002), China Scholarship Council Foundation, Beijing Higher Education Young Elite Teacher Project (YETP0380), Fundamental Research Funds for the Central Universities (FRF-TP-14-042A2), and Chinese National 863 research project (no. 2013AA01A209) supported this work.

References

1. J. Manyika et al., Big Data: The Next Frontier for Innovation, Competition, and Productivity,report, McKinsey Global Inst., 2011.

2. J. Kobielus et al., Enterprise Hadoop: The Emerging Core of Big Data, technology report, Forrester Research, Oct. 2011.

3. A. Beyer and D. Laney, The Importance of “Big Data”: A Definition, report, Gartner, 2012.

4. T. Kalil, “Big Data is a Big Deal,” blog, 29 Mar. 2012; www.whitehouse.gov/blog/2012/03/29/big-data-big-deal.

5. Top Tips for Securing Big Data Environments, e-book, IBM, 2012.

6. H.V. Jagadish et al., “Big Data and Its Technical Challenges,” Comm. ACM, vol. 57, no 7, 2014, pp. 86–94.

7. Expanded Top Ten Big Data Security and Pri-vacy Challenges, tech. report, Big Data Working Group, CSA Research, Apr. 2013; https://cloud-securityalliance.org/download/expanded-top-ten-big-data-security-and-privacy-challenges.

8. ISO/IEC 27002, Information Technology—Secu-

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___________

_________

_______________________________

_____________________________

_________________________


http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal

https://cloudsecurityalliance.org/download/expanded-top-ten-big-data-security-and-privacy-challenges






rity Techniques—Code of Practice for Informa-tion Security Management, Int’l Organization for Standardization (ISO) and Int’l Electrotechnical Commission (IEC), 2013.

9. J. Ginsberg et al., “Detecting Influenza Epidem-ics Using Search Engine Query Data,” Nature, vol. 457, no. 7232, 2008, pp. 1012–1014.

10. D. Borthakur, HDFS Architecture Guide, Ha-doop, 2013; http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

11. R. M. Salih, L. Lilien, and L.B. Othmane, 2012, “Protecting Patients’ Electronic Health Records Using Enhanced Active Bundles,” Proc. 6th Int’l Conf. Pervasive Computing Technologies for Healthcare, 2012, pp. 1–4.

HONGSONG CHEN is an associate professor of computer science at the University of Science and Technology Beijing (USTB) and a visiting scholar in the Department of Computer Science at Purdue University. His research interests include cloud com-puting, cloud security, big data application, wireless network security, and trust computing. Hongsong re-ceived a PhD in computer science from the Harbin Institute of Technology. He is a member of the China Computer Federation. Contact him at [email protected] or [email protected].

BHARAT BHARGAVA is a professor of computer science at Purdue University. His research interests include mobile wireless networks, secure routing and dealing with malicious hosts, providing security in service-oriented architectures (SOA), adapting to at-tacks, and experimental studies. Bhargava received a PhD in electrical engineering from Purdue Universi-ty. He is a fellow of IEEE Computer Society. Contact him at [email protected].

FU ZHONGCHUAN is an associate professor of computer science and technology at the Harbin Insti-tute of Technology. His research interests include trust computing, information security, multicore comput-ing, cloud computing, and fault-tolerate computing. Zhongchuan received a PhD in computer science and technology from the Harbin Institute of Technology. Contact him at [email protected].

NEWSTORE

Save up to

40%on selected articles, books,

and webinars.

Find the latest trends and insights for your

• presentations • research • events

webstore.computer.org


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________

____________

_________

_____ _______________

_______________

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html





http://webstore.computer.org







OVER THE LAST DECADE, TWO TECHNOL-OGY TRENDS HAVE BEEN CHANGING HOW ENTERPRISES THINK ABOUT INFRASTRUC-TURE, DATA, AND ANALYTICS—BIG DATA AND CLOUD COMPUTING. Here, we’ll look at the intersection of these two trends.

Big Data

Big data represents a new paradigm of data manage-ment (collection, processing, querying, data types, and scale) that isn’t well served by traditional data management systems. Two distinct paradigms are emerging in the big data space: working with data at rest, and working with streams of data in flight. We’ll focus on data at rest for now.

The big data ecosystem has seen some fast evo-lution. Most big data systems today incorporate Hadoop-based architectures (http://hadoop.apache.org) and are quickly becoming the center of the enterprise technology stack for data management. These architectures usually consist of several com-ponents: Hadoop Distributed File System (HDFS), MapReduce, YARN, and HBase, to name a few. For the purpose of this article, we’ll collectively refer to these as Hadoop. Terms like data lake and data hub refer to HDFS being the central storage system due to the scale and economics it has to offer, en-abling storage of data in full fidelity for long periods of time.

Cloud

Cloud computing refers to a paradigm for infra-structure, platforms, and software consumption in which users consume from a shared pool of resourc-es that someone else manages. Users pay for what they use. There are public cloud environments, such as Amazon Web Services (AWS), Google, and Mi-crosoft Azure, as well as software offerings, such as Openstack and VMWare, that you can use to build your own private cloud. We’ll limit the discussion to public cloud for now.

We can divide cloud computing technologies into three levels: infrastructure as a service (IaaS), plat-form as a service (PaaS), and software as a service (SaaS). These service levels aren’t new, but technol-ogy has evolved to make the consumption patterns look different from how they looked in the past. AWS extended the paradigm of end users interacting with and consuming a service programmatically without any human involvement in the early 2000s.2 Other vendors, such as Microsoft, Google, and IBM, have since forayed into this business as well.

Intersection of the Two Worlds

The worlds of big data and cloud computing (mostly IaaS) share some characteristics that make the in-tersection intuitive in some ways and counterintui-tive in others.

Motivation and Considerations

There are several motivations for using cloud envi-ronments for big data deployments as well as some considerations.

Bringing Big

Data Systems

to the Cloud

Amandeep Khurana, Cloudera


EDITOR:

ELI COLLINSCloudera

[email protected]

WHAT’S TRENDING?

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



____________

http://hadoop.apache.org







Cost. Total cost of ownership of infrastructure in-cludes hardware, power, racks, hosting space, and the people managing the infrastructure. Public cloud benefits from economies of scale and vendors often pass these benefits to the customers, who can simply consume the infrastructure without worrying about the operational costs.

Ease of use. Cloud computing is all about accessing resources programmatically and automating systems as much as possible. That’s not possible with bare-metal hardware, and ease of use is a big factor when considering deploying in cloud environments.

Elasticity. Big data workloads are often times spiky in nature. Users onboard new data sources and need to perform ad hoc processing to explore the datasets. This requires the ability to scale up the environment and perhaps scale down later on. With bare-metal infrastructure, you’ll have to provision for that burst requirement or you’ll have to wait for the IT team to provision new hardware. In cloud environments, you can scale up and down programmatically in a matter of minutes.

Operations. In public cloud environments, opera-tions is the cloud provider’s responsibility. Users don’t have to worry about operating the infrastruc-ture. If the system fails, they can recover by provi-sioning more resources.

Reliability. Some might argue that public cloud in-frastructures are less reliable than those in bare-metal because virtual machines have a higher chance of going down than physical servers. The flip side of that is that you can provision a new vir-tual machine much faster than you can procure and provision a new server. With that, reliability comes down to how you architect your system for fault tolerance.

Flexibility. Clouds offer different kinds of infra-structure configurations with minimal customiza-tion options. With bare-metal infrastructure, you can customize at the time of procurement. Having said that, most enterprises have standard infra-structure configurations that they use and custom-ization is uncommon.

Performance. Virtualization has a performance hit, especially for I/O intensive workloads. This has rap-idly decreased in recent times. For certain workloads, this hit might not be acceptable. For others where a slight variation and possibly lower performance is acceptable, cloud environments might be sufficient.

Security and compliance. Security and compliance are important considerations for enterprise deploy-ments. We can probably write several dedicated ar-ticles to cover all aspects. The key is that both cloud environments and Hadoop have been rapidly devel-oping and have come a long way to cater to the vari-ous requirements.

Location. Often, users want to keep their data close to where it’s generated. This could be because of the volume of data, where it’s accessed from, or re-strictions on where it can be moved. For example, certain kinds of data generated in China can’t be transferred outside the country. Public cloud envi-ronments offer the flexibility of having deployments in multiple locations without needing your own datacenters.

Intersection in Practice

Let’s look at how the intersection of the two para-digms exists today and where future opportunities exist.

Consumption paradigms. Two kinds of consumption paradigms exist for big data systems in public cloud environments.

In a hosted system, the vendor is hosting the in-frastructure on which big data software is deployed. Examples of this are enterprises deploying their own software in AWS, Azure, and Google.

In a managed and hosted system, the vendor hosts, operates, and manages the big data deploy-ment and infrastructure for you. This could entail anything from provisioning to debugging the envi-ronment when things fail. Examples include Ama-zon Elastic MapReduce, Qubole Data Service, and Altiscale.

Architectural considerations. The key architectural consideration in this intersection is the choice of persistent storage.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








WHAT’S TRENDING?

Public cloud environments offer storage sub-strates such as AWS Simple Storage Service (S3) and Azure blob store. These are stores where you can story binary large objects (blobs) of data. They have a simplistic API—get, put, and delete, for the most part. Public cloud environments are built on the premise that these stores are where data is stored for high durability and reliability. Virtual machines come with local storage but they’re ephemeral and live only for the lifespan of the virtual machine. Other storage options such as Elastic Block Store (EBS) and database services such as DynamoDB and Redshift treat S3 as their backup store, which enables them to guarantee durability. The public cloud world revolves around the blob stores.

For Hadoop, the world revolves around HDFS. All the processing and serving frameworks are tight-ly integrated with HDFS and leverage the semantics that HDFS has to offer. For example, MapReduce leverages data locality information to optimize task scheduling for minimal network usage. HBase has made architectural decisions that enable it to lever-age HDFS replication for fault tolerance as well as leverage the I/O characteristics of HDFS.

This fundamental difference in storage ap-proach defines how the two worlds integrate today and the opportunities going forward. Storing in HDFS allows for optimizations that can leverage data locality information for things such as task scheduling. Storing in a substrate like S3 allows for storage to be independent of compute, offering flex-ibility in resource management by using virtual ma-chines to do the computations when the workload demands. This difference in approach leads to dif-ferent deployment options.

Deployment paradigms. There are two kinds of big data deployment paradigms in the public cloud.

The first includes clusters that use a blob store as their primary storage substrate. These can be transient in nature, where clusters are spun up for executing a workflow and die once the workflow has completed. You could also have clusters that stay on beyond a workflow and you can use them for running more workflows later. The key here is that the workflow’s source and destination is a blob store such as S3 or Azure blob storage. In this de-ployment paradigm, data locality is traded for flex-

ibility in resource management. Virtual machines can be thought of as execution containers for tasks to run in.

The second deployment paradigm includes clus-ters that use HDFS as their primary storage sub-strate. These are usually persistent clusters where data is stored in HDFS. Blob stores can be used for periodic backups or as staging areas from which da-tasets are brought into HDFS for further usage and long-term storage. The workloads run against data in HDFS and not the blob store. Clusters are usu-ally long running and virtual machines are consid-ered to be persistent entities that store data as well as perform computation. They are the equivalent of servers instead of just containers where computation is performed.

Use Cases

The different deployment paradigms are suitable for different kinds of workloads.

Ad Hoc Batch Workloads

In an ad hoc batch workload, you have datasets stored somewhere or you’ve brought in a new data-set and want to perform some processing on it to cleanse, enrich, or transform it, or perhaps perform some aggregations across the dataset to explore the dataset. Tools of choice for expressing the process-ing include MapReduce, Hive, Pig, Crunch, Cas-cading, and Spark. These frameworks can read and write to the blob store or can work with datasets per-sisting in HDFS.

This type of workload can be modeled in a tran-sient cluster or a persistent cluster with storage be-ing the blob store or HDFS, which makes it a good match for public cloud environments.

Batch Workloads with SLAs

Batch workloads with strict SLAs are usually extract, transform, load (ETL) jobs or report generation that would be triggered based on schedule or data avail-ability and have an SLA attached to them. This kind of workload is automated and needs a higher predict-ability in performance and execution time.

This type of workload can also be modeled in a transient or a persistent cluster with storage be-ing the blob store or HDFS, which makes it a good match for public cloud environments.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________






Let your attendees have:

Android iPhoneiPad Kindle Fire.

CONFERENCESin the Palm of Your Hand

[email protected]


Ad Hoc Interactive Workloads

Ad hoc interactive workloads consist of interactive, fast querying using tools such as Impala, Presto, and Spark to some extent. This usually involves a user querying the dataset with response times on the or-der of seconds.

This type of workload is better modeled with persistent clusters with HDFS as the primary stor-age substrate. Tools such as Impala, Presto, and Spark integrate with HDFS and leverage data local-ity (and high throughput from local storage) to pro-vide fast response times during queries. Transient clusters aren’t a good idea here because the time taken to spin up a cluster will outweigh the time tak-en to run the query. These can be deployed in public clouds and would treat virtual machines as servers and blob stores for backups.

Interactive Workloads with SLAs

Interactive workloads with SLAs consist of using frameworks like HBase, Solr, and Impala, which are driving applications that users interact with and the response times have SLAs.

This type of workload must be deployed on per-sistent clusters with HDFS as the primary storage substrate and often times not colocated with any other workloads. These can be deployed in public clouds and would treat virtual machines as servers and blob stores for backups.

AS YOU CAN SEE, BIG DATA SYSTEMS CAN USE DIFFERENT DEPLOYMENT PARADIGMS BASED ON THE WORKLOADS AND ACCESS PATTERNS THEY CATER TO. Opportunities for tighter integration will enable big data systems to leverage public cloud environments more effective-ly. As both public cloud and big data systems see more adoption and new usage paradigms evolve, we’ll see features and enhancements on both sides to make the intersection of the two worlds broader and more mature.

Future articles will dive deeper into some of the topics that this article touched on.

Reference

1. R.S. Huckman, G.P. Pisano, and L. Kind, “Case Study: Amazon Web Services,” Harvard Busi-

ness Rev., 20 Oct. 2008; http://hbr.org/product/amazon-web-services/an/609048-PDF-ENG.

AMANDEEP KHURANA is a principal solutions architect at Cloudera. His research interests include large-scale distributed systems, storage systems, and data-oriented products. Khurana has an MS in com-puter science from the University of California, Santa Cruz. Contact him at [email protected].


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



_____________________________

____________

__________

http://hbr.org/product/amazon-web-services/an/609048-PDF-ENG








he transformation of IT through cloud computing is acceler-ating as a wide range of organizations are adopting this new approach for deploying a variety of applications. However, security concerns prevent many organizations from deploy-

ing certain types of applications in the cloud. They worry both about attacks on data being sent over the Internet to and from the cloud, and about whether their applications and data are more vulnerable to attack in a cloud than in their own internal computing resources.

In many cases, the use of good security engi-neering techniques can reduce both the risks and the fears.1,2 Another good approach is to exploit ro-bust, high-level platform-as-a-service (PaaS) com-ponents that are already deployed and managed by a cloud provider, rather than to design and build applications from the lower base offered by infra-structure as a service (IaaS). Examples include the SQL database server offered as a service on Micro-soft Azure (http://azure.microsoft.com), Amazon’s Simple Storage Service (S3, http://aws.amazon.com/s3), and domain-specific high-level platforms such as Force.com (http://force.com).

However, organizations still refuse to move some applications—such as those that deal with company-sensitive and medical data—to the cloud because of the perceived risk.

Rise of the Private Cloud

The obvious solution is to deploy these security-

sensitive applications on the organization’s internal computer infrastructure, which is often (rightly or wrongly) perceived to be more secure than the pub-lic cloud. However, these private clouds have some important limitations. Although scalability in a pub-lic cloud is effectively infinite for most applications because of the sheer magnitude of the resources available,3 private clouds are limited by the size of the organization’s internal IT. Further, not all the scalable, platform-level services that are offered on public clouds are available in the private cloud. For example, Amazon offers a range of scalable data storage solutions for structured and unstructured data, not all of which are available for deployment on private clouds.

To choose where to deploy an application, many system managers therefore ask, “Does my organiza-tion consider any part of this application too risky to deploy on a public cloud?” If they answer yes, they must deploy the application on a private cloud, with

Application Security

through Federated Clouds

Paul Watson

Newcastle University


Editor:

Rajiv Ranjan

Commonwealth

Scientific and In-

dustrial Research

Organization,

Australia

BLUE SKIES

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__

http://azure.microsoft.com

http://aws.amazon.com/s3

http://Force.com

http://force.com






its inherent restrictions. If not, the pub-lic cloud is an option.

Unfortunately, this is a wasted op-portunity for many applications that could benefit from the scalability and agility of public clouds, but include some sensitive data. Consider the typi-cal, though simplified, healthcare sensor analysis workflow shown in Figure 1.

This application takes in data about a patient—a mixture of sensitive medi-cal data that identifies the patient and heart-rate sensor data. It analyzes the sensor data and generates a summary that can then be stored with the rest of the patient data. Because there’s some sensitive data in the workflow, an orga-nization might feel that it has no choice but to store and analyze the data in a se-cure private cloud. This is unfortunate, because the analysis is often computa-tionally intensive, which makes it ideal for exploiting the scalability of the pub-lic cloud, particularly if, at peak times, data from many patients is arriving for analysis. Not only healthcare applica-tions suffer from this problem. We see equivalent security issues limiting the uptake of the cloud for applications in domains such as financial, human re-sources, and government.

Federated Clouds

An alternative is to exploit the best fea-tures of each: public clouds’ scalability and agility and private clouds’ security. In the case of the healthcare work-flow in Figure 1, a federated cloud (also known as hybrid cloud) approach would store the confidential medical data on a private cloud and send the sensor data (tagged with an anonymized ID) to the public cloud for analysis, with the re-sults returned to the private cloud to be combined with the confidential data.

For a simple workflow such as that in Figure 1, it isn’t too difficult to work out a way to partition the software to

meet security requirements. Larger, more complex applications might have many more components, and even dif-ferent security levels for different types of data. Making manual decisions on how to partition the application is er-ror prone and potentially fraught with danger as it could result in sensitive data being stored and processed in the public cloud.

As a result, some researchers have devised methods that can automatically determine ways to partition applications over federated clouds to meet security requirements. One approach,4,5 inspired by traditional multilevel security ap-proaches,6 models an application as a set of communicating distributed com-ponents, and uses rules to generate a set of inequalities that represent the secu-rity requirements.

This model uses the notation in Table 1 to define the rules in Figure 2. Applying these rules to the workflow in Figure 1 gives us the resulting lattice of inequalities shown in Figure 3 (an ar-row from a to b indicates that a ≥ b).

We can then substitute any known values for the variables and simplify the set of inequalities. For the running exam-ple, if we use only two security levels, 0 (low) and 1 (high), we can set the level of the patient’s medical data to 1 and that of

the rest to 0. Similarly, the security level of the service that reads the medical data needs to be 1, whereas the rest can be 0. Substituting these values into the in-equalities and simplifying produces:

l(p0) ≥ 1 ∧ l(p1) ≥ 1 ∧ l(p2) ≥ 0 ∧ l(p3)≥ 0 ∧ l(n0–1) ≥ 1 ∧ l(n1–2) ≥ 0 ∧ l(n2–3)≥ 0.

If we have a private cloud at secu-rity level 1 and a public cloud at level 0, and we assume that the networks within clouds have security level 1, the method can automatically determine that there are four possible ways to partition the workflow over public and private clouds to fulfill these security conditions. Fig-ure 4 shows the four valid options. A well-defined method such as this lets us build tools to automate option genera-tion and to deploy the application.

Researchers are also exploring other approaches for modeling and analyzing the security of applications partitioned over federated clouds. Proposals include using Petri nets to model information flows7 and optimizing the placement of partitions on clouds to meet quality-of-service (QoS) requirements.8 It’s also possible to extend the model to systems with external components, such as mo-bile devices and the Internet of Things.5

Read patientdata(s

0)

Anonymize(s

1)

Analyze(s

2)

Write results(s

3)

A. Smith378456729

p = 30%q = 27.4r = 34

FIGURE 1. A healthcare sensor analysis workflow. To protect sensitive data, an

organization might feel compelled to perform analysis of the data on a private

cloud, although it could realize significant benefit from exploiting the computational

resources available in the public cloud.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








BLUE SKIES

Policy-Based Partitioning of

Applications across Clouds

Security is only one of a set of crite-ria that might be used to make deci-sions about partitioning an application running over a set of clouds. Others include performance, price, and reli-

ability. This encourages the adoption of an architecture such as that in Fig-ure 5, in which a policy manager takes a high-level description of an applica-tion and partitions it based on a user-specified policy. It could, for example, include deadlines and price limits as

well as security requirements. The pol-icy manager could be simple and static, or built on a more sophisticated system of dynamic, managed service agree-ments between the application owner and the cloud providers, such as Arjuna Technologies’ Agility framework (www.arjuna.com/agility).

Employing a policy manager to de-ploy the components of a distributed system wouldn’t have been possible 10 years ago, when software deployment was largely manual and therefore static. However, with the rise of virtualization, we can structure distributed applications as a set of virtual machines (VMs) or other containers (example, www.docker.com) that can be dynamically deployed. Figure 5 shows another viable approach: running a portable, domain-specific plat-form (in this case, e-Science Central, which supports scientific workflows9)on each cloud to enable the dynamic deployment of application components.

When multiple options for meet-ing the requirements exist (as with the healthcare example), there is the prob-lem of how to choose between them. A solution is to introduce a cost model that can be applied to each solution, allowing them to be ranked. A simple pricing-based cost model, for example, could allow users to choose between all options that meet their security re-quirements.4 This model requires other

l(p0) l(n

0–1) c(s

1) l(p

1) l(n

1–2) c(s

2) l(p

2) l(n

2–3) c(s

3) l(p

3)

c(s0) l(d

0.0–1.0) l(d

1.0–2.0) l(d

2.0–3.0) l(s

3)

l(s0) l(s

1) l(s

2)

FIGURE 3. The security lattice created by applying the rules of Figure 2 to the healthcare workflow of Figure 1.

Table 1. Lexical conventions for the rules in Figure 2.

Notation Meaning

si Service i

pi Platform i

ni–j Network connecting platform i to platform j

di.x–j.y Data sent from service i port x to service j port y

l(z) Security location of z

c(z) Clearance of z (the maximum location at which z may operate)

FIGURE 2. Security rules to create a security lattice representing a distributed application.

For each service si add the inequality: l(pi) ≥ l(si)

(the security level of the platform on which the service is deployed must be greater

than or equal to that of the service).

For each data connection di:x–j add: l(pi) ≥ l(di:x–j:y) and l(pj) ≥ l(di:x–j:y)

(the security level of the platforms on which the services transmitting and receiving

the data are deployed must be greater than or equal to that of the data) and

l(ni–j) ≥ l(di:x–j:y)

(the security level of the network across which the data is transmitted must be

greater than or equal to that of the data).

To add data security as in Bell-LaPadula:

For each service add: c(si) ≥ l(si).

For each data connection add: c(sj) ≥ l(di:x–j:y) and l(di:x–j:y) ≥ l(si)

(the Bell-LaPadula “no read up” and “no write down” rules).

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



___

___

__________________________


http://www.arjuna.com/agility







inputs, such as estimates of the cost of executing services on a cloud, which is an interesting area of performance modeling research.10,11

Making Dynamic Decisions

The description thus far suggests a one-way flow from application description, through the policy manager, to deploy-ment. But today’s IT is highly dynamic. Consider, for example, a smartphone app that periodically sends sensitive data to a cloud for storage and analy-sis. The data owner might feel less con-cerned when the phone is connected over his or her company’s corporate Wi-Fi than when it’s communicating over a coffee shop’s Wi-Fi.

For this reason, the diagram in Fig-ure 5 also shows information flowing back to the policy manager from the clouds. Before a mobile client roams to a new network, it could send informa-tion to the policy manager about that network’s security (an unknown net-work could default to the lowest secu-rity level). The policy manager could then rerun the security analysis method described earlier. If the results show that the new network’s security level doesn’t satisfy the set of inequalities, the client could shut down or switch to a local-only mode of operation in which it caches data on the phone until it can reach a more secure network.

Similarly, an exception could be raised if a cloud failure occurs, trig-gering the policy manager to generate another deployment option for the ap-plication, assuming one exists. For the application workflow in Figure 1, for example, if the public cloud fails, the workflow can still be executed entirely on the private cloud. An optimization is not to restart a computation from the beginning, but to reuse any intermedi-ate results that have already been com-puted and are still accessible.12

ederated clouds offer a solution to the security problems preventing

some applications from exploiting the cloud’s benefits. The major research challenge is to find rigorous, auditable methods for dynamically partitioning

applications over public and private clouds to meet security and other non-functional requirements. This avoids the need for manual, ad hoc methods that are prone to errors that could have serious consequences.

1

4

2

3

Read patientdata(s

0)

Anonymize(s

1)

Analyze(s

2)

Write results(s

3)

Read patientdata(s

0)

Anonymize(s

1)

Analyze(s

2)

Write results(s

3)

Read patientdata(s

0)

Anonymize(s

1)

Analyze(s

2)

Write results(s

3)

Read patientdata(s

0)

Anonymize(s

1)

Analyze(s

2)

Write results(s

3)

FIGURE 4. Four valid partitioning options. In a federated cloud, workflows can be

partitioned over private (outlined in red) and public clouds (outlined in green) to meet

an application’s security requirements.

Azure OpenShift

e-Sciencecentral

e-Sciencecentral

e-Sciencecentral

Private cloud

Application Security, dependability,performance, cost requirements

Policy manager

FIGURE 5. Policy-based partitioning of an application over federated clouds. In

this example, a domain-specific platform (e-Science Central) runs on each cloud,

enabling dynamic deployment of application components.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM








BLUE SKIES

Acknowledgments

The Research Councils UK “Social In-clusion through the Digital Economy” project EP/G066019/1 funded this research.

References

1. R. Anderson, Security Engineering,John Wiley & Sons, 2008.

2. E.G. Amaroso, “Practical Meth-ods for Securing the Cloud,” IEEE Cloud Computing, vol. 1, no. 1, 2014, pp. 28–38.

3. M. Armbrust et al., Above the Clouds: A Berkeley View of Cloud Comput-ing, tech. report UCB/EECS-2009-28, Electrical Eng. and Computer Science Dept., Univ. of California, Berkeley, Feb 2009.

4. P. Watson, “A Multi-Level Security Model for Partitioning Workflows over Federated Clouds,” J. Cloud Computing, vol. 1, no. 1, 2012, pp. 1–15.

5. P. Watson and M. Little, “Multilevel Security for Deploying Distributed Applications on Clouds, Devices and Things,” to be presented at the IEEE Int’l Conf. Cloud Computing Technology and Science (Cloud-Com 14), 2014.

6. D.E. Bell and L.J. LaPadula, Se-cure Computer Systems: Mathe-matical Foundations, tech. report, Mitre, 1973.

7. W. Zeng et al., “A Flow Sensitive Se-curity Model for Cloud Computing Systems,” CoRR abs/1404.7760, Com-puting Research Repository, 2014.

8. E. Goettelmann, W. Fdhila, and C. Godart, “Partitioning and Cloud Deployment of Composite Web Ser-vices under Security Constraints,” Proc. IEEE Int’l Conf. Cloud Eng. (IC2E 13), 2013, pp. 193–200.

9. P. Watson, H. Hiden, and S. Wood-man, “e‐Science Central for CAR-MEN: Science as a Service,” Con-

currency and Computation: Practice and Experience, vol. 22, no. 17, 2010, pp. 2369–2380.

10. J. Taheri et al., “Pareto Frontier for Job Execution and Data Transfer Time in Hybrid Clouds, A Frame-work for Dynamically Generating Predictive Models of Workflow Ex-ecution,” Future Generation Com-puter Systems, vol. 37, July 2014, pp. 321–334.

11. H. Hiden, S. Woodman, and P. Watson, “A Framework for Dynami-cally Generating Predictive Models of Workflow Execution,” Proc. 8th Workshop Workflows in Support of Large-Scale Science, 2013, pp. 77–87.

12. Z. Wen and P. Watson, “Dynamic Exception Handling for Partitioned Workflow on Federated Clouds,” Proc. IEEE 5th Int’l Conf. Cloud Computing Technology and Science (CloudCom 13), vol. 1, 2013, pp. 198–205.

PAUL WATSON is a professor of com-puter science and director of the Digital Institute at Newcastle University, UK. He also directs the UK’s Social Inclusion through the Digital Economy Hub. His research interests include scalable infor-mation management with a current focus on cloud computing. Watson received a PhD in parallel functional programming from Manchester University. He is a Chartered Engineer and a Fellow of the British Computer Society. He received the 2014 Jim Gray eScience Award for his work on Clouds for Science. Contact him at [email protected].


qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________

_____________

___________________

_____________________

_____________

__________________________




http://www.computer.org/portal/web/computingnow/cga






2 3 2 5 - 6 0 9 5/ 14 /$ 31 . 0 0 © 2 0 14 I EEE S EP T E M B ER 2 0 14 I EEE CLO U D CO M P U T I N G 81

DAVID BERNSTEINCloud Strategy Partners, [email protected]

CLOUD TIDBITS

WELCOME TO CLOUD TIDBITS! In each issue, I’ll look at a different “tidbit” of technology that I consider unique or eye-catching, and of particular interest to the IEEE Cloud Computing readers.

Today’s tidbit focuses on container technology and how it’s emerging as an important part of the cloud computing infrastructure.

Cloud Computing’s Multiple OS Capability

Many formal definitions of cloud computing exist. The National Institute of Standards and Technol-ogy’s internationally accepted definition calls for “resource pooling,” where the “provider’s computing resources are pooled to serve multiple consumers using a multitenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.”1 It also calls for “rapid elasticity,” where “capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand.”

Most agree that the definition implies some kind of technology that provides an isolation and mult-itenancy layer, and where computing resources are split up and dynamically shared using an operating technique that implements the specified multiten-ant model. Two technologies are commonly used here: the hypervisor and the container. You might be familiar with how a hypervisor provides for vir-tual machines (VMs). You might be less familiar with containers, the most common of which rely on Linux kernel containment features, more commonly known as LXC (https://linuxcontainers.org). Both technologies support isolation and multitenancy.

Not all agree that a hypervisor or container is re-quired to call a given system a cloud; several special-ized service providers offer what is generally called a bare metal cloud, where they apply the referenced elasticity and automation to the rapid provisioning and assignment of physical servers, eliminating the overhead of a hypervisor or container altogether. Although interesting for the most demanding appli-cations, the somewhat oxymoron term “bare metal cloud” is something perhaps Tidbits will look at in more detail in a later column.

Thus, we’re left with the working definition that cloud computing, at its core, has hypervisors or con-tainers as a fundamental technology.

Cloud Systems with Hypervisors and

Containers

Most commercial cloud computing systems—both ser-vices and cloud operating system software products—use hypervisors. Enterprise VMware installations, which can rightly be called early private clouds, use the ESXi Hypervisor (www.vmware.com/products/es-xi-and-esx/overview). Some public clouds (Terremark, Savvis, and Bluelock, for example) use ESXi as well. Both Rackspace and Amazon Web Services (AWS) use the XEN Hypervisor (www.xenproject.org/developers/teams/hypervisor.html), which gained tremendous popularity because of its early open source inclusion with Linux. Because Linux has now shifted to sup-port KVM (www.linux-kvm.org), another open source

Containers and

Cloud: From

LXC to Docker

to Kubernetes

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________

_________________

____________________

_____________

_______________


https://linuxcontainers.org

http://www.vmware.com/products/esxi-and-esx/overview

http://www.xenproject.org/developers/teams/hypervisor.html

http://www.linux-kvm.org






CLOUD TIDBITS

alternative, KVM has found its way into more recent-ly constructed clouds (such as AT&T, HP, Comcast, and Orange). KVM is also a favorite hypervisor of the OpenStack project and is used in most OpenStack dis-tributions (such as RedHat, Cloudscaling, Piston, and Nebula). Of course, Microsoft uses its Hyper-V hy-pervisor underneath both Microsoft Azure and Micro-soft Private Cloud (www.microsoft.com/en-us/server-cloud/solutions/virtualization.aspx).

However, not all well-known public clouds use hypervisors. For example, Google, IBM/Softlayer, and Joyent are all examples of extremely successful public cloud platforms using containers, not VMs.

Some trace inspiration for containers back to the Unix chroot command, which was introduced as part of Unix version 7 in 1979. In 1998, an extended ver-sion of chroot was implemented in FreeBSD and called jail. In 2004, the capability was improved and released with Solaris 10 as zones. By Solaris 11, a full-blown ca-pability based on zones was completed and called con-tainers. By that time, other proprietary Unix vendors offered similar capabilities—for example, HP-UX con-tainers and IBM AIX workload partitions.

As Linux emerged as the dominant open plat-form, replacing these earlier variations, the technol-ogy found its way into the standard distribution in the form of LXC.

Figure 1 compares application deployment using a hypervisor and a container. As the figure shows, the hypervisor-based deployment is ideal when ap-plications on the same cloud require different op-erating systems or OS versions (for example, RHEL Linux, Debian Linux, Ubuntu Linux, Windows 2000, Windows 2008, Windows 2012). The abstrac-tion must be at the VM level to provide this capabil-ity of running different OS versions.

With containers, applications share an OS (and, where appropriate, binaries and libraries), and as a re-sult these deployments will be significantly smaller in size than hypervisor deployments, making it possible to store hundreds of containers on a physical host (versus a strictly limited number of VMs). Because containers use the host OS, restarting a container doesn’t mean restarting or rebooting the OS.

Those familiar with Linux implementations know that there’s a great degree of binary applica-tion portability among Linux variants, with librar-ies occasionally required to complete the portability. Therefore, it’s practical to have one container pack-age that will run on almost all Linux-based clouds.

Docker Containers

Docker (www.docker.com) is an open source project providing a systematic way to automate the faster deployment of Linux applications inside portable containers. Basically, Docker extends LXC with a kernel-and application-level API that together run processes in isolation: CPU, memory, I/O, network, and so on. Docker also uses namespaces to com-pletely isolate an application’s view of the underly-ing operating environment, including process trees, network, user IDs, and file systems.

Docker containers are created using base images. A Docker image can include just the OS fundamen-tals, or it can consist of a sophisticated prebuilt appli-cation stack ready for launch. When building images with Docker, each action taken (that is, command ex-ecuted, such as apt-get install) forms a new layer on top of the previous one. Commands can be executed manually or automatically using Dockerfiles.

Server Server

Host OS Host OS

Hypervisor Container engine

APPA

LibsLibsLibs

LibsLibsLibs

OSA

OSB

OSC

APPB

APPC

APPA

APPB

APPC

(a) (b)

Figure 1. Comparison of (a) hypervisor and (b) container-based

deployments. A hypervisor-based deployment is ideal when applications

on the same cloud require different operating systems or different OS

versions; in container-based systems, applications share an operating

system, so these deployments can be significantly smaller in size.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



______________________

__________________________


http://www.microsoft.com/en-us/server-cloud/solutions/virtualization.aspx







Each Dockerfile is a script composed of various commands (instructions) and arguments listed successively to auto-matically perform actions on a base image to create (or form) a new image. They’re used to organize deployment ar-tifacts and simplify the deployment pro-cess from start to finish.

Containers can run on VMs too. If a cloud has the right native container run-time (such as some of the clouds men-tioned) a container can run directly on the VM. If the cloud only supports hyper-visor-based VMs, there’s no problem—the entire application, container, and OS stack can be placed on a VM and run just like any other application to the OS stack.

Abstractions on Top of VMs and

Containers

Both VMs and containers provide a rath-er low-level construct. Basically, both present an operating system interface to the developer. In the case of the VM, it’s a complete implementation of the OS; you can run any OS that runs on the bare metal. The container gives you a “view” or a “slice” of an OS already run-ning. You access OS constructs as if you were running an application directly on the OS. Developers often build on this level of abstraction to provide more ap-plication runtime constructs, so users don’t feel like they’re running on a bare machine or a bare OS, but on an appli-cation runtime of some kind.

Virtual appliances, such as Virtu-alBox (www.virtualbox.org), Rightscale

Appliance,2 and Bitnami (https://bitnami.com), provide application runtime en-vironments that shield the application from the bare OS by providing an inter-face for applications with higher-level, more portable constructs. Virtual appli-ances gained popularity with equipment manufacturers who wanted to provide a vehicle for distributing software ver-sions of an appliance—for example, a network load balancer, WAN optimizer, or firewall. Virtual appliances can run on top of a VM or a container (native LXC-based or running on top of a VM).

For even more isolation from the OS, especially desired by application programmers, application runtimes can be reconfigured into total platform-as-a-service (PaaS) runtimes. Readers will remember that last issue I discussed Cloud Foundry PaaS, and mentioned that it uses container technology for de-ployment. It’s for precisely this reason they do so—the distribution can be tar-geted precisely for the container engine and Linux OS on the cloud, and like the virtual appliance can also run on top of a VM.

As Figure 2 shows, there are many possible layering combinations, depend-ing on the OS’s capabilities, the deploy-ment/portability strategy, and whether a PaaS is used.

How does one choose? As men-tioned earlier, the virtual appliance approach is a favorite vehicle used by network equipment manufacturers to create a portable software appliance.

Those who want to deploy applica-tions with the least infrastructure will choose the simple container-to-OS ap-proach. This is why container-based cloud vendors can claim improved performance when compared to hypervisor-based clouds. A recent benchmark of a “fast data” NewSQL system claimed that in an apples-to-apples comparison, running on IBM Softlayer using containers resulted in a fivefold performance improvement over the same benchmark running on Amazon AWS using a hypervisor.3

Software developers tend to prefer using PaaS, which will use a container if available for its runtime, to maximize per-formance as well as to manage application clustering. If not, the PaaS will run a con-tainer on a VM. Consequently, as PaaS gains in popularity, so do containers.

However, using containers for secu-rity isolation might not be a good idea. In an August 2013 blog,4 one of Dock-er’s engineers expressed optimism that containers would eventually catch up to VMs from a security standpoint. But in a presentation given in January 2014,5

the same engineer said that the only way to have real isolation with Docker is to either run one Docker per host, or one Docker per VM. If high security is needed, it might be worth sacrificing the performance of a pure-container de-ployment by introducing a VM to obtain more tried and true isolation. As with any other technology, you need to know the deployment’s security requirements, and make appropriate decisions.

Application

OS

VM

Container Container Container Container Container Container

PaaS PaaS

PaaS PaaS

PaaS

Virtualappliance

Virtualappliance

Virtualappliance

Virtualappliance

VM VM VM VM VM VM

OS OS OS OS OS OS OS OS OS OS

Figure 2. Possible layering combinations for application runtimes.

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________

___

http://www.virtualbox.org

https://bitnami.com






CLOUD TIDBITS

Open Source Cluster Manager for

Docker Containers

As mentioned earlier, one of containers’ nicest features is that they can be man-aged specifically for application clus-tering, especially when used in a PaaS environment. Answering this need, at the June 2014 Google Developer Forum, Google announced Kubernetes, an open source cluster manager for Docker con-tainers.6 According to Google, “Kuber-netes is the decoupling of application containers from the details of the sys-tems on which they run. Google Cloud Platform provides a homogenous set of raw resources . . . to Kubernetes, and in turn, Kubernetes schedules containers to use those resources. This decoupling simplifies application development since users only ask for abstract resources like cores and memory, and it also simplifies data center operations.”

Google goes on to describe network-centric deployment improvements in Kubernetes: “While running individual containers is sufficient for some use cas-es, the real power of containers comes from implementing distributed systems, and to do this you need a network. How-ever, you don’t just need any network. Containers provide end users with an abstraction that makes each container a self-contained unit of computation. Tradi-tionally, one place where this has broken down is networking, where containers are exposed on the network via the shared host machine’s address. In Kubernetes, we’ve taken an alternative approach: that each group of containers (called a Pod) deserves its own, unique IP address that’s reachable from any other Pod in the clus-ter, whether they’re co-located on the same physical machine or not.”

Industry Movement around

Kubernetes

Shortly after Google’s announcements, several players endorsed Kubernetes—

and therefore Docker and containers—as a core cloud deployment technology.7

In addition to a host of start-ups (such as CoreOS, MesoSphere, and Salt-Stack), Kubernetes supporters include:

• Google (for Google Cloud Engine, GCE),

• Microsoft (for Microsoft Azure),• VMware,• IBM (for Softlayer and OpenStack),

and• Red Hat (its OpenStack distribution).

Although HP, Canonical, AWS, and Rack-space are “Docker friendly,” they haven’t explicitly endorsed Kubernetes. Industry speculation is that once a more neutral governance/collaboration structure is put together around Docker (a start-up com-pany) and Kubernetes (still controlled by Google), organizations will agree on a common packaging and deployment ap-proach—and here we have practically ev-eryone already thinking about it. I’m not aware of any cloud project with this level of alignment on anything!

CONTAINERS, DOCKER, AND KUBERNETES SEEM TO HAVE SPARKED THE HOPE OF A UNIVER-SAL CLOUD APPLICATION AND DEPLOYMENT TECHNOLOGY. And that, my friends, qualified it to be this issue’s Cloud Tidbit. I hope you enjoyed it!

References

1. P. Mell and T. Grance, The NIST Def-inition of Cloud Computing: Recom-mendations of the National Institute of Standards and Technology, NIST Spe-cial Publication 800-145, 2011.

2. U. Thakrar, “Introducing Right-Scale Cloud Appliance for vSphere,” blog, 10 Dec. 2013; www.rightscale.com/blog/enterprise-cloud-strategies/

introducing-rightscale-cloud-appliance-vsphere.

3. B. Kepes, “VoltDB Puts the Boot into Amazon Web Services, Claims IBM Is Five Times Faster,” Forbes,6 Aug. 2014; www.forbes.com/sites/benkepes/2014/08/06/voltdb-puts-the-boot-into-amazon-web-services-claims-ibm-5-faster.

4. J. Petazzoni, “Containers & Dock-er: How Secure Are They?” blog, 21 Aug. 2013; http://blog.docker.com/2013/08/containers-docker-how-secure-are-they.

5. J. Petazzoni, “Linux Containers (LXC), Docker, and Security,” 31 Jan. 2014; www.slideshare.net/jpetazzo/linux-containers-lxc-docker-and-security.

6. C. Mcluckie, “Containers, VMs, Ku-bernetes and VMware,” blog, 25 Aug. 2014; http://googlecloudplatform.blogspot.com/2014/08/containers-vms-kubernetes-and-vmware.html.

7. B. Butler, “Containers: Buzzword du Jour, or Game-Changing Technol-ogy?” NetworkWorld, 3 Sept. 2014; www.networkworld.com/article/2601925/cloud-computing/container-party-vmware-microsoft-cisco-and-red-hat-all-get-in-on-app-hoopla.html.

DAVID BERNSTEIN is the managing director of Cloud Strategy Partners, co-founder of the IEEE Cloud Computing Initiative, founding chair of the IEEE P2302 Working Group, and origina-tor and chief architect of the IEEE In-tercloud Testbed Project. His research interests include cloud computing, dis-tributed systems, and converged commu-nications. Bernstein was a University of California Regents Scholar with highest honors BS degrees in both mathemat-ics and physics. Contact him at [email protected].

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



__________________________

_____

____

________________________

________________________

________________________

_______________________

________________________

______

_____________

_____________

________________________

________________________

_____

________________________


http://www.forbes.com/sites/benkepes/2014/08/06/voltdb-puts-the-boot-into-amazon-web-services-claims-ibm-5-faster

http://blog.docker.com/2013/08/containers-docker-how-secure-are-they

http://www.slideshare.net/jpetazzo/linux-containers-lxc-docker-and-security

http://googlecloudplatform.blogspot.com/2014/08/containers-vms-kubernetes-and-vmware.html

http://www.networkworld.com/article/2601925/cloud-computing/container-party-vmware-microsoft-cisco-and-red-hat-all-get-in-on-app-hoopla.html


http://www.rightscale.com/blog/enterprise-cloud-strategies/introducing-rightscale-cloud-appliance-vsphere

http://www.rightscale.com/blog/enterprise-cloud-strategies/introducing-rightscale-cloud-appliance-vsphere





PLUG INTO THE FUTURE OF ELECTRICITY TODAYwww.ii–intelect.org

PRESENTING THE FIRST OF ITS KIND PLATFORM PRESENTING THE INTELLIGENT ELECTRICITY ECOSYSTEMThe 2015 IEEE-IEEMA INTELECT Conference and Exposition will include the first-ever 3-in-1 global platform that brings together a $10 billion business opportunity across industry verticals. It will feature:

Interactive Display PavilionsA World Class ExpoGlobal Conference:Smart Electricity for Emerging Markets

connected intelligence in electricity of things

THE FUTURE OF ELECTRICITY IS HERE. ARE YOU READY?

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://www.ii-intelect.org





JESSE HARRINGTON AU

Chief Maker Advocate,

Autodesk

BRIAN GAFF

Partner, McDermott Will

& Emery, LLP

PAUL BRODY

VP & Global Industry

Leader of Electronics, IBM

PRINTINGROCKSTARS OF3D

3D Printing Will Actually

Change the World! Are You Ready?

Every company needs to prepare and implement 3D printing in order to

remain relevant in their industry! No one can sit this phenomenon out!

Get ready at Rock Stars of 3D Printing, a one-day event featuring the

experts, early adopters, and visionaries that are driving this revolution.

Develop Your 3D Printing Strategy! Ask Questions. Network with

Experts. See Exhibits. Shift Your Paradigms!

Here’s a list of other Rock Star speakers for Rock Stars of 3D Printing:

• Paul Brody, Vice President and Global Industry Leader of Electronics, IBM

• Brian David Johnson, Futurist and Director, Future Casting and

Experience Research, Intel

• Cliff Waldman, Council Director and Senior Economist, Manufacturers

Alliance for Productivity and Innovation

17 March 2015

The Fourth Street Summit Center

San Jose, CA

REGISTER NOW

Early Discount Pricing Now Available!

computer.org/

3dprinting

qqM

Mq

qM

MqM



qqM

Mq

qM

MqM



http://www.computer.org/3dprinting