Data and Computation Interoperability in Internet Services

Data and Computation Interoperability in Internet Services

Sergey Boldyrev, Dmitry Kolesnikov, Ora Lassila, Ian Oliver

Nokia Corporation

Espoo, Finland

e-mail: {sergey.boldyrev, dmitry.kolesnikov, ora.lassila ian.oliver}@nokia.com

Abstract— The rapid shift of technology towardscloud computing requires a next generation dis-tributed systems that should be seamlessly spannedaround heterogeneous concepts of the informationproviders, cloud infrastructures, and terminal equip-ments. The framework for interoperability is required;a detailed methods and set of supporting tools fordeveloping an end-to-end architecture where aspectsof data and computation semantics, scalability, se-curity and privacy are resolved and leveraged inorder to provide a end-to-end foundation of suchsystems. Our objectives are to define the scope ofthe framework and to identify technology choicesfrom perspective of scalable technology and busi-ness models with high elasticity.

Keywords- service; semantic web; data; ontology;computation; privacy; latency; cloud computing; in-teroperability;

I. INTRODUCTION

The ICT industry is undergoing a major shifttowards cloud computing. From the technologi-cal standpoint, this entails incorporating both “in-house” and outsourced storage and computinginfrastructures, integrating multiple partners andservices, and providing users a seamless expe-rience across multiple platforms and devices. Onthe business side, there is an anticipation of im-proved flexibility, including the flexibility of introduc-ing new, multisided business models where valuecan be created through interactions with multiple(different types of) players. Many companies aretrying to answer questions like “Who might find thisinformation valuable?”, “What would happen if weprovided our service for free of charge”, and “Whatif my competitor(s) did so?”. The answers to such

questions are hightlighting opportunities for disrup-tion and identification of vulnerabilities [13]. Thus,slicing and investigating technological componentswe are seeking the answers in further thinking andcollaboration work with industrial researches andacademic society.

In a world of fully deployed and availablecloud infrastructures we can anticipate the emer-gence of distributed systems that can seamlesslyspan the “information spaces” of multiple hardware,software, information and infrastructure providers.Such systems can take advantage of finely granularand accountable processing services, can integrateheterogeneous information sources by negotiat-ing their associated semantics, and ultimately freeusers of mundane challenges and details of tech-nology usage (such as connectivity, locus of dataor computation, etc.). We foresee greater reuse ofinformation and computational tasks.

Some of the inputs of this work have been thework of Marco Rodriguez about the RDF virtualmachine, the Haskell state monads and variouspapers about out of order execution like [5], [6] Theobjective was to realize and to describe semanti-cally a scenario in which computation can be repre-sented in a machine understandable way and sentto another execution environment where, using thesemantic, it can be reconstructed and computed.The computations target are not low level machinelevel instructions like in Rodriguez work, nor simplyfunctions, but a particular kind of function that canbe executed or put in chain with other functions ofthe same type like it happens with monadic valuesthat can be put in chain with monadic operators like

“bind”. Another important objective of the semanticrepresentation of computation in this work was torealize a system in which computations originatedby a certain HW/SW combination, can be executedindependently from the executor operating systemor hardware architecture. This new computationparadigm requires explicit techniques to facilitatea computation management within distributed envi-ronments. Non-functional aspects such us privacy,security and performance needs to be controlledusing the behavior analysis as the system sustainload. As such design and best practices for perfor-mance optimizations within network and softwarecomponents is widely studies by multiple projects[9], [10], [11], [12], etc. The computation man-agement targets are not low level optimization ofprotocol or software stack behavior but accountablemechanism enabling orchestration of computationsacross boundaries of environments.

In order for us to understand the opportunitiesand – perhaps more importantly – the challenges ofsuch a future, we must establish a clear frameworkfor interoperability. Aspects of data and compu-tation semantics, performance and scalability ofcomputation and networking, as well as threatsto security and privacy must be discussed andelaborated. This paper aims to be the beginningof this discussion.

We begin in Section 2 with informal definitionof data semantics via set of declared dependen-cies and constrains used to reflect data processingrules. Section 3 presents a computations in a formof accountable components that can be leveragedand reused in a larger, distributed systems. Section4 discusses the role of system behavior analysisin order to achieve adequate level of system ac-countability. Finally Section 4, privacy enforcementare discussed by which ownership, granularity andcontent semantic are managed.

II. DATA

A. About Semantics

In the broader sense, in the context of data,the term semantics is understood as the “meaning”of data. In this document, we take a narrow, morepragmatic view, and informally define semanticsto be the set of declared dependencies and con-straints that define how data is to be processed. Ina contemporary software system, some part of thesystem (some piece of executable software) alwaysdefines the semantics of any particular data item.In a very practical sense, the “meaning” of a dataitem arises from one of two sources:

1) The relationships this item has with otherdata items and/or definitions (including thesemantics of the relationships themselves). Inthis case, software “interprets” what we knowabout the data item (e.g., the data schemaor ontology). An elementary example of thisis object-oriented polymorphism, where theruntime system can pick the right software toprocess an object because the class of theobject happens to be a subclass of a “known”class; more elaborate cases include ontology-based reasoning that infers new things aboutan object.

2) Some software (executing in the system run-time) that “knows” how to process this dataitem directly (including the system runtimeitself, as this typically “grounds” the semanticsof primitive datatypes, etc.). In this case, thesemantics is “hard-wired” in the software.

All software systems have semantics, whether thesemantics is defined implicitly or explicitly. As fordata specifically, the semantics are typically de-fined explicitly, in the form of datatype definitions,class definitions, ontology definitions, etc. (thesefall under category #1 above; semantics of thosethings that fall under category #2 are typically eitherimplicit or in some cases explicit in the form ofexternal documentation).

B. Constituent Technologies of “Data Infrastruc-ture”

From a practical standpoint, all applicationsneed to query, transmit, store and manipulate data.All these functions have dependencies to tech-nologies (or categories of technologies) which wein aggregate shall call Data Infrastructure. Theconstituent technologies themselves have variousdependencies, illustrated (in a simplified form) inFigure 1. From the data semantics standpoint, weare mostly here focused on the technologies insidethe box labeled “Data Modeling Framework”.

Application

Data Modeling Framework

Storage Infrastructure

Internal Data

Domain Models

Object Model

Queries

Query Language Serialization

Transmission

Schema Language

Query Engine Storage

Figure 1. Dependencies in “Data Infrastructure”

Domain Models: Application semantics islargely defined via a set of domain models whichestablish the necessary vocabulary of the (real-world) domain of an application. Meaningful inter-operability between two systems typically requiresthe interpretation of such models.

Schema Language: Domain models are ex-pressed using some schema language. Typicallysuch languages establish a vocabulary for dis-cussing types (classes), their structure and theirrelationship with other types. Examples of schemalanguages include SQL DDL, W3C’s OWL and XML

Schema as well as Microsoft’s CSDL. Class defi-nitions in object-oriented programming languages(e.g., Java) can also be considered as schemalanguages.

Query Language: An application’s queries areexpressed in some query language; this languagemay either be explicit (such as SQL or SPARQL)or implicit (in the sense that what effectively arequeries are embedded in a programming lan-guage). Queries are written using vocabularies es-tablished by a query language (the “mechanics” ofa query) and relevant domain models.

Serialization: For the purposes of transmission(and sometimes also for storage) a “syntactic”,external form of data has to be established. A stan-dardized serialization syntax allows interoperatingsystems to exchange data in an implementation-neutral fashion. Serialization syntaxes do not haveto be dependent on domain models.1

Object Model: Ultimately, there are some def-initions about the structure common to all data;we refer to these definitions as an object model(in some literature and in other contexts, the termmetamodel is also used). Essentially, an objectmodel is the definition of “what all data looks like”and thus forms the basis for a schema language,a query language and a serialization syntax. Ex-amples of object models include W3C’s XML DOMand RDF.2

C. On Metadata and Provenance

Interoperable exchange of data can often ben-efit from metadata to describe the data being inter-changed. Typically, such exchange occurs in usecases involving management of data that originatesin multiple sources (and may – independently ofsources – have multiple owners) and/or data that

1In practice, developers often have a hard time separatingconceptual views of data, domain models, and like, from con-crete, syntactic manifestations of data. This has lead to all kindsof unnecessary dependencies in the implementations of largesystems. Consider yourselves warned.

2We are referring to RDF’s graph-based meta-model, not RDFSchema which should be considered a schema language.

is subject to one or more policies (security, privacy,etc.). Much of the meta-information we aspire torecord about data is in various ways associatedwith data provenance. Over the years lots of workhas been invested in provenance in various ways– [2] provides a general overview of these effortsin the database field. In cases where a multi-cloudsolution is used to implement workflows that pro-cess data, we may also be interested in solutionsfor workflow provenance.

In practice, incorporating all or even any meta-information into one’s (domain) data schema orontology is difficult, and easily results in a verycomplicated data model that is hard to understandand maintain. It can, in fact, be observed thatprovenance-related metadata attributes (such assource and owner ) are cross-cutting aspects ofassociated data, and by and large are independentof the domain models in question. The problem issimilar to the problem of cross-cutting aspects insoftware system design, a situation3 that has led tothe development of Aspect-Oriented Programming[1].

From the pragmatic standpoint, the granularityof data described by specific metadata is an impor-tant consideration: Do we record metadata at thelevel of individual objects (in which case “mixing”metadata with the domain schema is achievable inpractice), or even at finer (i.e., “intra-object”) gran-ularity; a possible solution to the latter is presentedin [4].

III. COMPUTATION

From the computation perspective we arefaced with a task to enable and create consistent(from the semantics standpoint) and accountablecomponents that can be leveraged and reusedin a larger, distributed information system. Thistranslates to several pragmatic objectives, including(but not limited to) the following:

3The motivating situation is sometimes referred to as the“tyranny of dominant decomposition” [3].

• Design and implementation of scenarios wherecomputation can be described and repre-sented in a system-wide understandable way,enabling computation to be “migrated” (trans-mitted) from one computational environmentto another. This transmission would involvereconstruction of the computational task at thereceiving end to match the defined semanticsof task and data involved. It should be notedthat in scenarios like this the semantics ofcomputation are described at a fairly high,abstract level (rather than, say, in terms of low-level machine instructions).

• Construction of a system where larger compu-tational tasks can be decomposed into smallertasks or units of computation, independent ofwhat the eventual execution environment(s) ofthese tasks (that is, independent of hardwareplatforms or operating systems).

Traditional approaches to “mobile code” haveincluded various ways to standardize the runtimeexecution environment for code, ranging from hard-ware architectures and operating systems to vari-ous types of virtual machines. In a heterogeneousconfiguration of multiple device and cloud envi-ronments, these approaches may or may not befeasible, and we may want to opt for a very highlevel (possibly purely declarative) description ofcomputation (e.g., describing computation as a setof goals to be satisfied).

High-level, declarative descriptions of com-putation (e.g., functional programming languagessuch as Erlang) afford considerable latitude forthe target execution environment to decide howcomputation is to be organized and effected. Thisway, computation can be scheduled optimally withrespect to various constraints of the execution en-vironment (latency budget, power or other resourceconsumption, remaining battery level, etc.) and de-cisions can also be made to migrate computationto other environments if needed.

IV. MANAGED CLOUD PERFORMANCE

A new outlined paradigm of distributed sys-tems requires an explicit orchestration of com-putation across Edge, Cloud and Core depend-ing on components states and resource demand,through the proper exploit of granular and thereforeaccountable mechanisms. The behavior analysisof the software components using live telemetrygathered as the cloud sustain production load ismandatory.

End-to-end latency is seen by us as a majormetric for quality assessment of software architec-ture and underlying infrastructure. It yields bothuser-oriented SLA, the objectives for each Cloudstage. The ultimate goal is the ability to quanti-tatively evaluate and trade-off software quality at-tributes to ensure competitive end-to-end latencyof Internet Services.

On another hand, the task of performanceanalysis is to design systems as cost effectively aspossible with a predefined grade of service whenwe know the future traffic demand and the capacityof system elements. Furthermore, it is the task tospecify methods for controlling that the actual gradeof service is fulfilling the requirements, and alsoto specify emergency actions when systems areoverloaded or technical faults occur.

When applying Managed Cloud Performancein practice, a series of decision problem concerningboth short-term and long-term arrangements ap-pears:

• provide granular latency control for diversedata and computational load in hybrid Cloudarchitectures;

• resolution of short term capacity decisions in-clude for example the determination of optimalconfiguration, the number of lives servers ormigration of computation;

• resolution of long term capacity decisions suchas decisions concerning the development andextension of data- and service architecture,choice of technology, etc;

• control of SLA at different levels, e.g. consumer

services, platform components, etc;

If performance metrics of a solution is inad-equate then business is impacted. It has beenreported by multiple sources [11]: 100ms of la-tency on Amazon cost them 1% in sales; GoldmanSachs makes profit of a 500 ms trading advantage.Goldman Sachs used Erlang for the high-frequencytrading programs.

The role of software behavioral analysis there-fore is crucial in order to achieve proper level ofaccountability of the whole system, constructed outof the number of the granular components of Dataand Computation, thus, creating a solid fundamentfor the Privacy requirements.

V. PRIVACY IN THE CLOUD

The scope of privacy enforcement being pro-posed here is to provide mechanisms by whichusers can finely tune the ownership, granularity,content, semantics and visibility of their data to-wards other parties. We should note that securityis very closely linked with privacy; security, how-ever, tends to be about access control, whetheran agent has access to a particular piece of data,whereas privacy is about constraints on the usageof data. Because we have not had the technologicalmeans to limit usage, we typically cast privacyas an access control problem; this precludes usfrom implementing some use cases that would beperfectly legitimate, but because the agent cannotaccess relevant data we cannot even discuss whatusage is legal and what is not. In other words,security controls whether a particular party hasaccess to a particular item of data, while privacycontrols whether that data can or will be used fora particular purpose.

As Nokia builds its infrastructure for data, theplacement and scope of security, identity and anal-ysis of that data become imperative. We can thinkof the placement of privacy as described in Figure2 (note that in this diagram, by “ontology” we reallymean semantics).

Given any data infrastructure we require at

minimum a security model to support the integrityof the data, an identity model to denote the owneror owners of data (and ideally also provenance ofthe data) and an ontology to provide semantics ormeaning to the data. Given these we can constructa privacy model where the flow of data throughthe system can be monitored and to establishboundaries where the control points for the datacan be placed.

Figure 2. Privacy, overview

VI. CONCLUSIONS

As stated in above, the rapidly shifting tech-nology environment raises serious questions forexecutives about how to help their companies tocapitalize on the transformation that is underway.Three trends – anything as a service, multisidedbusiness models and innovation from the bottomof the pyramid – augur far-reaching changes inthe business environment and require radical shiftsin strategy. The fine-grained (and thus easily ac-countable) software frameworks allow customersto monitor, measure, customize, and bill for assetuse at much more fine-grained level than before.Particularly, b2b customers would benefit sincesuch granular services would allow the purchasingof fractions of service and to account for them asa variable cost rather than consider some capitalinvestments and corresponding overhead.

Therefore, the driving agenda above can beseen as a creation of tools for the next devel-

oper and consumer applications, computing plat-form and backend infrastructure, backed up byapplication technologies to enable qualitative leapin services’ offering, from the perspectives of scal-able technology and business models with highelasticity, supporting future strategic growth areasand maintenance requirements.

REFERENCES

[1] G. Kiczales, “Aspect-Oriented Programming,” ACMComputing Surveys, vol. 28, no. 4, 1996.

[2] P. Buneman and W.-C. Tan, “Provenance indatabases,” in SIGMOD ’07: Proceedings of the 2007ACM SIGMOD international conference on Manage-ment of data. ACM, 2007, pp. 1171–1173.

[3] H. Ossher and P. Tarr, “Multi-Dimensional Sep-aration of Concerns in Hyperspace,” in Proceed-ings of Aspect-Oriented Programming Workshop atECOOP’99, C. Lopes, L. Bergmans, A. Black, andL. Kendall, Eds., 1999, pp. 80–83.

[4] O. Lassila, M. Mannermaa, I. Oliver, and M. Sabbouh,“Aspect-Oriented Data,” accepted submission to theW3C Workshop on RDF Next Steps, World Wide WebConsortium, June 2010.

[5] M. A. Rodriguez, “The RDF virtual machine,” inKnowledge-Based Systems, Volume 24, Issue 6, Au-gust 2011, pp 890-903.

[6] U. Reddy, “Objects as closures: abstract semantics ofobject-oriented languages,” Published in: Proceedingsof the 1988 ACM conference on LISP and functionalprogramming , Proceeding LFP ’88, ACM New York,NY, USA 1988, pp 289-297.

[7] QtClosure project https://gitorious.org/qtclosure, re-trieved Oct. 2011

[8] SlipStream project https://github.com/verdyr/SlipStream,retrieved Oct. 2011

[9] Cisco System Inc., Design Best Practices for LatencyOptimization, White Paper

[10] S. Souder, High Performance Web Sites, O’Relly,2007, ISBN 0-596-52930-9

[11] The Psychology of Web Performance,http://www.websiteoptimization.com/speed/tweak/psychology-web-performance, retrieved Jun. 2011

[12] L. Kleinrock, Queueing Systems, Wiley Interscience,1976., vol. II: Computer Applications (Published inRussian, 1979. Published in Japanese, 1979.)

[13] McKinsey Column, “Clouds, big data, and smartassets,” Financial Times, Septemeber 22 2010.

Technology

Data and Computation Interoperability in Internet Services