20
P A R A L L E L D A T A L A B O R A T O R Y C A R N E G I E M E L L O N U N I V E R S I T Y AN INFORMAL PUBLICATION FROM ACADEMIAS PREMIERE STORAGE SYSTEMS RESEARCH CENTER DEVOTED TO ADVANCING THE STATE OF THE ART IN STORAGE SYSTEMS AND INFORMATION INFRASTRUCTURES. PDL P acket THE THE NEWSLETTER ON PDL ACTIVITIES AND EVENTS FALL 2003 http://www.pdl.cmu.edu/ CONTENTS PDL CONSORTIUM MEMBERS EMC Corporation Hewlett-Packard Labs Hitachi, Ltd. IBM Corporation Intel Corporation Microsoft Corporation Network Appliance Panasas, Inc. Oracle Corporation Seagate Technology Sun Microsystems Veritas Software Corporation Automating Storage Management ...... 1 Director’s Letter ................................... 2 New PDL Faculty ................................. 3 Year in Review ..................................... 4 Recent Publications ............................. 5 Database I/O Optimization .................. 8 PDL News .......................................... 10 Proposals & Defenses ....................... 13 Toward Better File Searching ............. 14 Self-* Storage Greg Ganger & Joan Digney Human administration of storage systems is a large and growing issue in modern IT infrastructures. We are exploring new storage architectures that integrate automated management functions and simplify the human administrative task. Self-* storage systems (pronounced “self-star”— a play on the unix shell wild-card character) are self-configuring, self-organizing, self-tuning, self-healing, self-man- aging systems of storage bricks. Borrowing organizational ideas from corporate structure and technologies from AI and control systems, self-* storage should sim- plify storage administration, reduce system cost, increase system robustness, and simplify system construction. The Self-* Storage Architecture Dramatic simplification of storage administration requires that associated functionalities be designed into the storage system from the start and integrated throughout the design. System components must continually collect information about ongoing tasks and regular evaluation of configuration and workload partitioning must occur, all within the context of high-level administrator guidance. The self-* storage project is designing an architecture from a clean slate to explore such integration. The high- level system architecture is shown in Figure 1. The Brick-based Storage Infrastructure. We envision self-* storage systems composed of networked “intelligent” storage bricks, each consisting of CPU(s), RAM, and a number of disks; example brick de- signs provide 0.5-5 TB of storage with moderate lev- els of reliability, availability, and performance. Each storage brick (a "worker") self-tunes and adapts its operation to its observed workload and assigned goals. Data redundancy across and within storage bricks provides fault toler- ance and creates opportu- nities for automated reconfiguration to handle many problems. Out-of- band supervisory pro- cesses assign datasets and goals to workers, track Continued on page 12 hierarchy Management I/O request routing complaints Goal specification & Performance information & delegation I/O requests & replies Administrator Supervisors Workers (storage bricks) Routers Clients Figure 1: Architecture of self-* storage. At the top of the diagram is the management hierarchy, concerned with the distribution of goals and the delegation of storage responsibilities from the system administrator down to the individual worker devices. The path of I/O requests from clients, through routing nodes, to the workers for service is shown at the bottom. Note that the management infrastructure is logically independent of the I/O request path, and that the routing is logically independent of the clients and the workers.

R AL L E L AD A TA L BOR T THE O P A R Y PDL Packet C A R ... · p a r a l l e l a d a ta l b o r t o r y c a r n e gi e me lo n u n i v e r s i t y an informal publication from academia’s

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

PA

RALLEL DATA LABORATO

RY

CA

RN

EGIE MELLON UNIVERSI

TY

AN INFORMAL PUBLICATION FROM

ACADEMIA’S PREMIERE STORAGE

SYSTEMS RESEARCH CENTER

DEVOTED TO ADVANCING THE

STATE OF THE ART IN STORAGE

SYSTEMS AND INFORMATION

INFRASTRUCTURES.

PDL PacketTHE

T H E N E W S L E T T E R O N P D L A C T I V I T I E S A N D E V E N T S • F A L L 2 0 0 3http://www.pdl.cmu.edu/

CONTENTS

PDL CONSORTIUMMEMBERS

EMC Corporation

Hewlett-Packard Labs

Hitachi, Ltd.

IBM Corporation

Intel Corporation

Microsoft Corporation

Network Appliance

Panasas, Inc.

Oracle Corporation

Seagate Technology

Sun Microsystems

Veritas Software Corporation

Automating Storage Management ...... 1

Director’s Letter ................................... 2

New PDL Faculty ................................. 3

Year in Review ..................................... 4

Recent Publications ............................. 5

Database I/O Optimization .................. 8

PDL News .......................................... 10

Proposals & Defenses ....................... 13

Toward Better File Searching ............. 14

Self-* StorageGreg Ganger & Joan Digney

Human administration of storage systems is a large and growing issue in modernIT infrastructures. We are exploring new storage architectures that integrateautomated management functions and simplify the human administrative task.Self-* storage systems (pronounced “self-star”— a play on the unix shell wild-cardcharacter) are self-configuring, self-organizing, self-tuning, self-healing, self-man-aging systems of storage bricks. Borrowing organizational ideas from corporatestructure and technologies from AI and control systems, self-* storage should sim-plify storage administration, reduce system cost, increase system robustness, andsimplify system construction.

The Self-* Storage ArchitectureDramatic simplification of storage administration requires that associated functionalitiesbe designed into the storage system from the start and integrated throughout thedesign. System components must continually collect information about ongoingtasks and regular evaluation of configuration and workload partitioning must occur,all within the context of high-level administrator guidance. The self-* storage projectis designing an architecture from a clean slate to explore such integration. The high-level system architecture is shown in Figure 1.

The Brick-based StorageInfrastructure. We envisionself-* storage systemscomposed of networked“intelligent” storage bricks,each consisting of CPU(s),RAM, and a number ofdisks; example brick de-signs provide 0.5-5 TB ofstorage with moderate lev-els of reliability, availability,and performance. Eachstorage brick (a "worker")self-tunes and adapts itsoperation to its observedworkload and assignedgoals. Data redundancyacross and within storagebricks provides fault toler-ance and creates opportu-nities for automatedreconfiguration to handlemany problems. Out-of-band supervisory pro-cesses assign datasets andgoals to workers, track

Continued on page 12

hierarchy

Management

I/O request

routing

complaintsGoal specification &

Performanceinformation &

delegation

I/O requests & replies

AdministratorSupervisors

Workers(storage

bricks)

Routers

Clients

Figure 1: Architecture of self-* storage. At the top of the diagram is themanagement hierarchy, concerned with the distribution of goals andthe delegation of storage responsibilities from the system administratordown to the individual worker devices. The path of I/O requests fromclients, through routing nodes, to the workers for service is shown atthe bottom. Note that the management infrastructure is logicallyindependent of the I/O request path, and that the routing is logicallyindependent of the clients and the workers.

T H E P D L P A C K E T

The Parallel Data Laboratory

School of Computer Science

Department of ECE

Carnegie Mellon University

5000 Forbes Avenue

Pittsburgh, PA 15213-3891

VOICE 412•268•6716

FAX 412•268•3010

PUBLISHER

Greg Ganger

EDITOR

Joan Digney

The PDL Packet is published once per yearand provided to members of the PDLConsortium. Copies are given to otherresearchers in industry and academia aswell. A pdf version resides in thePublications section of the PDL Web pagesand may be freely distributed.Contributions are welcome.

COVER ILLUSTRATION

Skibo Castle and the lands that com-priseits estate are located in the Kyle ofSutherland in the northeastern part ofScotland. Both ‘Skibo’ and ‘Sutherland’ arenames whose roots are from Old Norse,the language spoken by the Vikings whobegan washing ashore reg-ularly in thelate ninth century. The word ‘Skibo’fascinates etymologists, who are unableto agree on its original meaning. All agreethat ‘bo’ is the Old Norse for ‘land’ or‘place.’ But they argue whether ‘ski’means ‘ships’ or ‘peace’ or ‘fairy hill.’

Although the earliest version of Skiboseems to be lost in the mists of time, itwas most likely some kind of fortifiedbuilding erected by the Norsemen. Thepresent-day castle was built by a bishopof the Roman Catholic Church. AndrewCarnegie, after making his fortune,bought it in 1898 to serve as his sum-merhome. In 1980, his daughter, Mar-garet,donated Skibo to a trust that later soldthe estate. It is presently being run as aluxury hotel.

2 T H E P D L P A C K E T

FROM THE DIRECTOR’S CHAIR

G r e g G a n g e r

Hello from fabulous Pittsburgh!2003 has been an exciting year in the ParallelData Lab, with a number of projects maturingand contributing to a broad, new research initia-tive in automated storage administration. Alongthe way, two faculty and several new studentsjoined the Lab, several students spent summers

with PDL Consortium companies, and several students won new Fellowships(one each from Intel, NSF, and DoD).The PDL continues to pursue a broad array of storage systems research, fromthe underlying devices to the applications that rely on storage. The past yearbrought excellent progress in existing projects and the initiation of exciting newones. Let me highlight a few things.The biggest development has been the initiation of a major new project calledSelf-* Storage, which explores the design and implementation of self-organiz-ing, self-configuring, self-tuning, self-healing, self-managing systems of stor-age bricks. For years, PDL Retreat attendees have been pushing us to attack“storage management of large installations,” and this project is our response.With generous equipment donations from the PDL Consortium companies, wehope to put together and maintain 100s of terabytes of storage to be used byourselves and other CMU researchers (e.g., in data mining, astronomy, andscientific visualization). Of course, the research challenge is to make the sys-tem almost entirely self-*, so as to avoid the traditional costs of storage admin-istration —deploying a real system will allow us to test our ideas in practice.Towards this end, we are designing the Self-* Storage architecture with a clean-slate, integrating management functions throughout. In realizing the design, weare combining a number of recent and ongoing PDL projects (e.g., freeblockscheduling, PASIS, and self-securing storage) and ramping up efforts on newchallenges such as block-box device and workload modeling, automated deci-sion making, and automated diagnosis.PDL's push into database storage management has also produced an excitingnew project: Fates. A Fates-based database storage manager transparentlyexploits select knowledge of the underlying storage infrastructure to automati-cally achieve robust, tuned performance. As in Greek mythology, there arethree Fates: Atropos, Clotho, and Lachesis. The Atropos volume manager stripesdata across disks based on track boundaries and exposes aspects of the result-ing parallelism. The Lachesis storage manager utilizes track boundary and par-allelism information to match database structures and access patterns to theunderlying storage. The Clotho dynamic page layout allows retrieval of just thedesired table attributes, eliminating unnecessary I/O and wasted main memory.Altogether, the three Fates components simplify database administration, in-crease performance, and avoid performance fluctuations due to query interfer-ence. Other database systems projects are exploring benchmarking techniques,automatic physical database design, and new software architectures for robustload management.The self-securing devices project has made great strides. Highlighted in 2002by several news organizations, this project adapts medieval warfare notions tothe defense of networked computing infrastructures. In a nutshell, devices areaugmented with relevant security functionality and made intrusion-independentfrom client OSes and other devices. This architecture makes systems moreintrusion-tolerant and more manageable when under attack. The self-securingdevices vision has brought with it many interesting challenges and a healthy

P A R A L L E L D A T A

L A B O R A T O R Y

FALL 2003

CONTCONTCONTCONTCONTACTACTACTACTACT USUSUSUSUSWEB PAGESWEB PAGESWEB PAGESWEB PAGESWEB PAGES

PDL Home: http://www.pdl.cmu.edu/

Please see our web pages athttp://www.pdl.cmu.edu/PEOPLE/

for further contact information.

FFFFFACULACULACULACULACULT YT YT YT YT Y

Greg Ganger (director)412•268•1297

[email protected] Ailamaki

[email protected] Brockwell

[email protected] Faloutsos

[email protected] Gibson

[email protected] Goldstein

[email protected] Harchol-Balter

[email protected] Long

[email protected] Mowry

[email protected] Perrig

[email protected] Reiter

[email protected] Satyanarayanan

[email protected] [email protected]

Dawn [email protected]

Chenxi [email protected]

Hui [email protected]

STSTSTSTSTAFF MEMBERSAFF MEMBERSAFF MEMBERSAFF MEMBERSAFF MEMBERSKaren Lindenfelser

(pdl business administrator)412•268•6716

[email protected] Bielski

Mike BigriggJohn Bucy

Joan DigneyGregg Economou

Ken TewLinda Whipkey

STUDENTSSTUDENTSSTUDENTSSTUDENTSSTUDENTS

Michael Abd-El-MalekMukesh AgrawalKinman AuShimin ChenGarth GoodsonJohn Linwood GriffinStavros HarizopoulosJames HendricksAndrew KlostermanChris LumbAmit ManjhiMichael MesnierJim NewsomeSpiros PapadimitriouStratos PapadomanolakisAdam PenningtonGinger Perng

David PetrouBrandon SalmonRaja Sambasivan

Jiri SchindlerSteve SchlosserMinglong Shao

Vlad ShkapenyukShafeeq Sinnamohideen

Craig SoulesJohn Strunk

Eno ThereskaNiraj Tolia

Mengzhi WangTed WongJay Wylie

Shuheng Zhou

3

FROM THE DIRECTOR’S CHAIR

NEW PDL FACULTY

Anthony BrockwellDr. Anthony Brockwell, Assistant Professor in theDept. of Statistics at Carnegie Mellon, received hisPh.D. in Statistics and Electrical Engineering fromthe University of Melbourne in 1998. He joined thePDL in June to collaborate on our Self-* Storageproject. Dr. Brockwell’s research interests are pri-marily in the areas of dynamical systems and Baye-sian computational methods. He is particularly in-terested in the analysis and control of nonlinear and

non-Gaussian systems, and in associated computational methods such asparticle filtering and Markov chain Monte Carlo. Dr. Brockwell has pub-lished articles in such journals as SIAM Journal on Control and Optimiza-

Continued on page 4

source of funding. We have continued our exploration of how to build andexploit network interface software for containing compromised client systems,and explored in depth a new way of detecting intruders: storage-based intru-sion detection. Storage devices are uniquely positioned to spot some commonintruder actions (e.g., scrubbing audit logs and inserting backdoors), makingthis an exciting new concept.

Other ongoing PDL projects are also producing cool results. For example, thePASIS project has produced a family of efficient and scalable consistency pro-tocols that support a wide range of fault models that require no change to theserver infrastructure. A new project has begun exploring the use of contextinformation for assigning attributes to files to enable effective attribute-basedsearches. We have built a working freeblock scheduling system and API inFreeBSD, and are using it for research activities such as continuous disk reor-ganization; it will also be a core component of each self-* storage brick. De-centralized caching is being explored in several contexts, including wide-areasystems that opportunistically utilize CAS overlays and NFS clusters that dy-namically shed load by replicating read-only files. This newsletter and the PDLwebsite offer more details and additional research highlights.

On the education front: This Spring, for the third time, we offered our newstorage systems course to undergraduates and masters students at CarnegieMellon. Topics span the design, implementation, and use of storage systems,from the characteristics and operation of individual storage devices to the OS,database, and networking techniques involved in tying them together and mak-ing them useful. The base lectures were complemented by real-world exper-tise generously shared by 8 guest speakers from industry, including severalmembers of the SNIA Technical Council. We continue to work on the book,and several other schools have already picked up and started teaching similarstorage systems courses. We view providing storage systems education as criti-cal to the field’s future; stay tuned.

I’m always overwhelmed by the accomplishments of the PDL students andstaff, and it’s a pleasure to work with them. As always, their accomplishmentspoint at great things to come.

4 T H E P D L P A C K E T

YEAR IN REVIEW

NEW PDL FACULTY

tion, Journal of Time Series Analy-sis and Journal of Computational andGraphical Statistics.

MahadevSatyanarayanan

At long last,Satya has join-ed the PDL!Long renown-ed as a leadingresearcher indistributed files y s t e m s ,Satya’s re-search has giv-en us many of

the principles on which today’s andtomorrow’s files systems are based.

Key ideas from the Coda file system,which supports disconnected andbandwidth-adaptive operation, havebeen incorporated by Microsoft intothe IntelliMirror component of Win-dows. Another outcome of his re-search is Odyssey, a set of open-source operating system extensionsfor enabling mobile applications toadapt to variation in critical resourcessuch as bandwidth and energy. Codaand Odyssey are building blocks inProject Aura, a research initiative atCarnegie Mellon to build a distraction-free ubiquitous computing environ-ment. Earlier, Satyanarayanan was a

Continued from page 3

October 200311th Annual PDL Retreat andWorkshop.

September 2003

Jiri Schindler presented twopapers at VLDB in Berlin:“Lachesis: Robust DatabaseStorage Management Based onDevice-specific PerformanceCharacteristics,” and “MatchingDatabase Access Patterns toStorage Characteristics,” whichwon the Ph.D. Workshop’s BestPaper Award.Christos Faloutsos gave thekeynote talk “Next GenerationData Mining Tools: Power lawsand self-similarity for graphs,streams and traditional data” atECML, Dubrovnik, Croatia.Spiros Papadimitriou presented“Adaptive, Hands-Off StreamMining” at VLDB.

August 2003

Greg visited Intel in Portland, OR,to present Self-* Storage.Adam Pennington presented“Storage-based Intrusion Detec-tion” at USENIX Security ’03 inWashington, DC.

Chris Long organized a Birds of aFeather session on “Security andUsability” at the USENIXSecurity Symposium in Washing-ton, DC.Christos Faloutsos gave a tutorialon “Mining Time Series Data” atICML 2003, Washington DC.Jiri Schindler successfullydefended his PhD dissertationtitled “Matching ApplicationAccess Patterns to StorageDevice Characteristics.”

July 2003Greg visited HP, Microsoft andVeritas in CA and presentedSelf-* Storage.

June 2003Brandon Salmon presented“A Two-Tiered Software Archi-tecture for Automated Tuning ofDisk Layouts” at the AASMSWorkshop in San Diego, CA.Greg also attended.Christos Faloutsos presented atutorial on “Internet Researchmeets Data Mining: CurrentKnowledge and New Tools” atSIGMETRICS 2003, San Diego.Niraj Tolia attended the UsenixAnnual Technical Conference in

San Antonio, TX and presentedthe paper “Opportunistic Use ofContent Addressable Storage forDistributed File Systems.”

May 2003Craig Soules spoke on “WhyCan’t I Find My Files? NewMethods for Automating At-tribute Assignment” at HotOS inLihue, HI. Greg also attended.Greg attended the AFSORprogram review in Colorado andpresented Self-Securing DevicesAdam Pennington spent thesummer interning at Seagate inPittsburgh; Brandon Salmoninterned at Microsoft Research,and Craig Soules interned withHP Labs.

April 2003Craig Soules presented “Meta-data Efficiency in Versioning FileSystems” at FAST 03; Greg andmany other PDL students alsoattended.Fifth annual PDL Industry VisitDay.Chris Long co-organized theWorkshop on Human-Computer

Continued on page 19

principal architect and implementor ofthe Andrew File System (AFS), whichhas been commercialized by IBM.

Dr. Satyanarayanan is the CarnegieGroup Professor of Computer Scienceat Carnegie Mellon University. He iscurrently serving as the founding di-rector of Intel Research Pittsburgh,which focuses on software systemsfor distributed data storage. He is thefounding Editor-in-Chief of IEEE Per-vasive Computing. He received hisPh.D. in Computer Science from Car-negie Mellon, after completingBachelor’s and Master’s degreesfrom the Indian Institute of Technol-ogy, Madras. He is also a Fellow ofthe ACM and the IEEE.

FALL 2003 5

RECENT PUBLICATIONS

Lachesis: Robust DatabaseStorage Management Based onDevice-specific PerformanceCharacteristics

Schindler, Ailamaki & Ganger

VLDB 03, Berlin, Germany, Sept 9-12, 2003.

Database systems work hard to tuneI/O performance, but do not alwaysachieve the full performance poten-tial of modern disk systems. Theirabstracted view of storage compo-nents hides useful device-specificcharacteristics, such as disk trackboundaries and advanced built-in firm-ware algorithms. This paper presentsa new storage manager architecture,called Lachesis, that exploits andadapts to observable device-specificcharacteristics in order to achieve andsustain high performance. For DSSqueries, Lachesis achieves I/O effi-ciency nearly equivalent to sequentialstreaming even in the presence ofcompeting random I/O traffic. In ad-dition, Lachesis simplifies manual con-figuration and restores the optimizer’sassumptions about the relative costsof different access patterns expressedin query plans. Experiments usingIBM DB2 I/O traces as well as a pro-totype implementation show thatLachesis improves standalone DSSperformance by 10% on average.More importantly, when running con-currently with an on-line transactionprocessing (OLTP) workload,

Lachesis improves DSS performanceby up to 3X, while OLTP also exhibitsa 7% speedup.

A Two-Tiered SoftwareArchitecture for AutomatedTuning of Disk Layouts

Salmon, Thereska, Soules & Ganger

First Workshop on Algorithms andArchitectures for Self-Managing Sys-tems. In conjunction with FederatedComputing Research Conference(FCRC). San Diego, CA. June 11,2003.

Many heuristics have been developedfor adapting on-disk data layouts toexpected and observed workloadcharacteristics. This paper describesa two-tiered software architecture forcleanly and extensibly combining suchheuristics. In this architecture, eachheuristic is implemented independentlyand an adaptive combiner merges theirsuggestions based on how well theywork in the given environment. Theresult is a simpler and more robustsystem for automated tuning of disklayouts, and a useful blueprint for othercomplex tuning problems such ascache management, scheduling, datamigration, and so forth.

Efficient Consistency forErasure-coded Data viaVersioning Servers

Goodson, Wylie, Ganger & Reiter

Carnegie Mellon University Techni-cal Report CMU-CS-03-127, April2003.

This paper describes the design, imple-mentation and performance of a fam-ily of protocols for survivable, decen-tralized data storage. These protocolsexploit storage-node versioning to ef-ficiently achieve strong consistencysemantics. These protocols allow era-sure-codes to be used that achievenetwork and storage efficiency (andoptionally data confidentiality in theface of server compromise). The pro-tocol family is general in that its pa-rameters accommodate a wide range

of fault and timing assumptions, up toasynchrony and Byzantine faults ofboth storage-nodes and clients, withno changes to server implementationor client-server interface. Measure-ments of a prototype storage systemusing these protocols show that theprotocol performs well under varioussystem model assumptions, numbersof failures tolerated, and degrees ofreader-writer concurrency.

A Human Organization Analogyfor Self-* Systems

Strunk & Ganger

First Workshop on Algorithms andArchitectures for Self-Managing Sys-tems. In conjunction with FederatedComputing Research Conference(FCRC). San Diego, CA. June 11,2003.

The structure and operation of humanorganizations, such as corporations,offer useful insights to designers ofself-* systems (a.k.a. self-managingor autonomic). Examples includeworker/supervisor hierarchies, avoid-ance of micro-management, and com-plaint-based tuning. This paper ex-plores the analogy, and describes thedesign of a self-* storage system thatborrows from it.

Exposing and Exploiting InternalParallelism inMEMS-based Storage

Schlosser, Schindler, Ailamaki &Ganger

Carnegie Mellon University Techni-cal Report CMU-CS-03-125, March2003.

MEMS-based storage has interestingaccess parallelism features. Specifi-cally, subsets of a MEMStore’s thou-sands of tips can be used in parallel,and the particular subset can be dy-namically chosen. This paper de-scribes how such access parallelismcan be exposed to system software,with minimal changes to system in-

OPTIMIZER

EXECUTION

STORAGE Manager

I/Orequests

buffers

parsed SQL query

query planConfigurationparameters

Query optimization and execution in a typicalDBMS. Continued on page 6

6 T H E P D L P A C K E T

. . .

Desktops,servers,

etc.

Firewalland

NIDS

. . . . . .

. . .

Wide

Area

Network

Local

Area

Network

. . .

Desktops,servers,

etc.

Firewalland

NIDS

. . . . . .

. . .

Wide

Area

Network

Local

Area

Network

SSNI

SSNI

SSNI

SSNI

SSNI

SSNI

(A)

(B)

RECENT PUBLICATIONS

terfaces, and utilized cleanly for twoclasses of applications. First, back-ground tasks can utilize unused paral-lelism to access media locations withno impact on foreground activity. Sec-ond, two-dimensional data structures,such as dense matrices and relationaldatabase tables, can be accessed inboth row order and column order withmaximum efficiency. With propertable layout, unwanted portions of atable can be skipped while scanningat full speed. Using simulation, weexplore performance features of us-ing this device parallelism for an ex-ample application from each class.

Data Page Layouts for RelationalDatabases on Deep MemoryHierarchies

Ailamaki, DeWitt & Hill

The VLDB Journal 11(3), 2002.

Relational database systems have tra-ditionally optimized for I/O perfor-mance and organized records sequen-tially on disk pages using the N-aryStorage Model (NSM) (a.k.a., slottedpages). Recent research, however, in-dicates that cache utilization and per-formance is becoming increasinglyimportant on modern platforms. In thispaper, we first demonstrate that in-page data placement is the key to highcache performance and that NSMexhibits low cache utilization on mod-ern platforms. Next, we propose a newdata organization model called PAX(Partition Attributes Across), that sig-nificantly improves cache perfor-mance by grouping together all val-ues of each attribute within each page.Because PAX only affects layout in-side the pages, it incurs no storagepenalty and does not affect I/O be-havior. According to our experimen-tal results (which were obtained with-out using any indices on the partici-pating relations), when compared toNSM (a) PAX exhibits superior cacheand memory bandwidth utilization, sav-ing at least 75% of NSM’s stall timedue to data cache accesses, (b) range

selection queries and updates onmemory-resident relations execute 17-25% faster, and (c) TPC-H queriesinvolving I/O execute 11-48% faster.Finally, we show that PAX performswell across different memory systemdesigns.

Self-* Storage: Brick-basedStorage with AutomatedAdministration

Ganger, Strunk & Klosterman

Carnegie Mellon University Techni-cal Report, CMU-CS-03-178, August2003.

This white paper describes a newproject exploring the design and imple-mentation of “self-* storage systems:”self-organizing, self-configuring, self-tuning, self-healing, self-managingsystems of storage bricks. Borrow-ing organizational ideas from corpo-rate structure and automation tech-nologies from AI and control systems,we hope to dramatically reduce theadministrative burden currently facedby data center administrators. Further,compositions of lower cost compo-nents can be utilized, with availableresources collectively used to achievehigh levels of reliability, availability, andperformance.

Finding and Containing EnemiesWithin the Walls with Self-securing Network Interfaces

Ganger, Economou & Bielski

Carnegie Mellon University Techni-cal Report CMU-CS-03-109, January2003.

Self-securing network interfaces(NIs) examine the packets that theymove between network links and hostsoftware, looking for and potentiallyblocking malicious network activity.This paper describes how self-secur-ing network interfaces can help ad-ministrators to identify and containcompromised machines within theirintranet. By shadowing host state,

self-securing NIs can better identifysuspicious traffic originating from thathost, including many explicitly de-signed to defeat network intrusiondetection systems. With normalizationand detection-triggered throttling, self-securing NIs can reduce the ability ofcompromised hosts to launch attackson other systems inside (or outside)the intranet. We describe a prototypeself-securing NI and example scan-ners for detecting such things as TTLabuse, fragmentation abuse, “SYNbomb” attacks, and random-propaga-tion worms like Code-Red.

Object-Based Storage

Mesnier, Ganger & Riedel

IEEE Communications Magazine, v.41n.8 pp 84-90, August 2003.

Storage technology has enjoyed con-siderable growth since the first diskdrive was introduced nearly 50 yearsago, in part facilitated by the slow andsteady evolution of storage interfaces

Continued from page 5

Continued on page 7

Self-securing network interfaces. Self-securing network interfaces. Self-securing network interfaces. Self-securing network interfaces. Self-securing network interfaces. (A)shows the conventional network securityconfiguration, wherein a firewall and a NIDSprotect LAN systems from some WAN attacks.(B) shows the addition of self-securing NIs,one for each LAN system.

FALL 2003 7

RECENT PUBLICATIONS

(SCSI and ATA/IDE). The stability ofthese interfaces has allowed continualadvances in both storage devices andapplications, without frequent changesto the standards. However, the inter-face ultimately determines the func-tionality supported by the devices, andcurrent interfaces are holding systemdesigners back. Storage technologyhas progressed to the point that achange in the device interface isneeded. Object-based storage is anemerging standard designed to ad-dress this problem. In this article wedescribe object-based storage, stress-ing how it improves data sharing, se-curity, and device intelligence. We alsodiscuss some industry applications ofobject-based storage and academicresearch using objects as a founda-tion for building even more intelligentstorage systems.

Why Can’t I Find My Files? NewMethods for AutomatingAttribute Assignment

Soules & Ganger

Proceedings of the Ninth Workshopon Hot Topics in Operating systems,USENIX Association, May 2003.

This paper analyzes various algorithmsfor scheduling low priority disk drivetasks. The derived closed form solu-tion is applicable to a class of greedyalgorithms that includes a variety ofbackground disk scanning applica-tions. By paying close attention tomany characteristics of modern diskdrives, the analytical solutions achievevery high accuracy — the differencebetween the predicted response timesand the measurements on two differ-ent disks is only 3% for all but oneexamined workload. This paper alsoproves a theorem which shows thatbackground tasks implemented bygreedy algorithms can be accom-plished with very little seek penalty.Using greedy algorithm gives a 10%shorter response time for the fore-ground application requests and up toa 20% decrease in total background

task run time compared to results frompreviously published techniques.

Storage-based IntrusionDetection: Watching StorageActivity For Suspicious Behavior

Pennington, Strunk, Griffin, Soules,Goodson & Ganger

12th USENIX Security Symposium,Washington, D.C., Aug 4-8, 2003. Anearly version is available as CarnegieMellon University Technical ReportCMU-CS-02-179, September 2002.

Storage-based intrusion detection al-lows storage systems to watch fordata modifications characteristic ofsystem in-trusions. This enables stor-age systems to spot several commonintruder actions, such as addingbackdoors, inserting Trojan horses,and tampering with audit logs. Fur-ther, an intrusion detection system(IDS) embedded in a storage devicecontinues to operate even after clientsystems are compromised. This pa-per describes a number of specificwarning signs visible at the storageinterface. Examination of 18 real in-trusion tools reveals that most (15) canbe detected based on their changesto stored files. We describe and evalu-ate a prototype storage IDS, embed-ded in an NFS server, to demonstrateboth feasibility and efficiency of stor-age-based intrusion detection. In par-ticular, both the performance overheadand memory required (152 KB for4730 rules) are minimal.

A Case for Staged DatabaseSystems

Harizopoulos & Ailamaki

In proceedings of the First Interna-tional Conference on Innovative DataSystems Research (CIDR), Asilomar,CA, January 2003.

Traditional database system architec-tures face a rapidly evolving operat-ing environment, where millions ofusers store and access terabytes of

data. In order to cope with increasingdemands for performance, high-endDBMS employ parallel processingtechniques coupled with a plethora ofsophisticated features. However, thewidely adopted, work-centric, thread-parallel execution model entails sev-eral shortcomings that limit serverperformance when executingworkloads with changing require-ments. Moreover, the monolithic ap-proach in DBMS software has leadto complex anddifficultto extend de-signs. This paper introduces a stageddesign for high-performance, evolv-able DBMS that are easy to tune andmaintain. We propose to break thedatabase system into modules and toencapsulate them into self-containedstages connected to each otherthrough queues. The staged, data-cen-tric design remedies the weaknessesof modern DBMS by providing solu-tions at both a hardware and a soft-ware engineering level.

Adaptive, Hands-Off StreamMining

Papadimitriou, Brockwell &Faloutsos

Carnegie Mellon University SCSTechnical Report CMU-CS-02-205.Also published in Proceedings VLDB03, Berlin, Germany, Sept 9-12, 2003.

Sensor devices and embedded proces-sors are becoming ubiquitous, espe-cially in measurement and monitoringapplications. Automatic discovery ofpatterns and trends in the large vol-umes of such data is of paramountimportance. The combination of rela-

Continued from page 6

Continued on page 16

11

m1

module 1

IN

poisson

CPU

. . .OUT1N

mN

module N

1i : time to load module i

mi: mean service time when module i is already loaded

A production-line model for staged servers.

8 T H E P D L P A C K E T

Figure 2: Operations performed by eachcomponent of the database system, showingthe content of a single memory frame at eachstage of a data request during query execution.

bufferpool

disk 0

payload data directly placed

via scatter/gather I/O

access to payload

disk array

page hdr

Buffer Pool Manager

OPERATORS

disk 1

Lachesis

Storage Manager

Atropos

Logical Volume Mgr

requests for pages with asubset of attributes (payload)

storage inteface exposes efficient access to non-contiguous blocks

Clotho

(tblscan, idxscan, ... )

DATABASE QUERY EXECUTION WITH FATES

Current database systems use data lay-outs that can exploit unique features ofonly one level of the memory hierar-chy (cache/main memory or on-linestorage). Such layouts optimize for thepredominant access pattern of oneworkload (e.g., DSS), while trading offperformance of another workload type(e.g., OLTP). Achieving efficient execu-tion of different workloads without thistrade-off or the need to manually re-tunethe system for each workload type isstill an unsolved problem. The “Fates”database system project answers thischallenge.

The primary goal of the project is toachieve efficient execution of com-pound database workloads at all levelsof a database system memory hierar-chy. By leveraging unique characteris-tics of devices at each level, the Fatesdatabase system can automaticallymatch query access patterns to the re-spective device characteristics, whicheliminates the difficult and error-pronetask of manual performance tuning.

Borrowing from the Greek mythologyof The Three Fates–Clotho, Lachesis,and Atropos–who spin, measure, andcut the thread of life, the three compo-nents of our database system (bearingthe Fates’ respective names) establishproper abstractions in the databasequery execution engine. These abstrac-

tions cleanly separate the functionalityof each component (described below)while allowing efficient query executionalong the entire path through the data-base system.

The main feature of the Fates databasesystem is the decoupling of the in-memory data layout from the on-diskstorage layout. Different data layouts ateach memory hierarchy level can betailored to leverage specific device char-acteristics at that level. An in-memorypage layout can leverage L1/L2 cachecharacteristics, while another layout canleverage characteristics of storage de-vices. Additionally, the storage-device-specific layout also yields efficient I/Oaccesses to records for different queryaccess patterns. Finally, this decouplingalso provides flexibility in determiningwhat data to request and keep aroundin main memory.

Traditional database systems are forcedto fetch and store unnecessary data asan artifact of a chosen data layout. TheFates database system, on the otherhand, can request, retrieve, and store justthe needed data, catering to the needsof a specific query. This conserves stor-age device bandwidth, memory capac-ity, and avoids cache pollution—all ofwhich improves query execution time.

Scatter/gather I/O facilitates efficienttransformation from one layout into an-other and creates an organization ame-nable to the individual query needs on-the-fly. In addition to eliminating ex-pensive data copies, these I/Os matchexplicit storage device characteristics,ensuring efficient execution at the stor-age device. Finally, thanks to proper ab-stractions established by each Fate,other database system components(e.g., the query optimizer or the lock-ing manager) remain unchanged.

CLOTHO

Clotho ensures efficient query execu-tion at the cache/main-memory leveland figures at the inception of a requestfor particular data. When the data de-sired by a query is not found in the buffer

pool, Clotho creates a skeleton of a pagein a memory frame describing what datais needed and how to lay it out.

The page layout builds upon a cache-friendly layout, called PAX [1], whichgroups data into minipages, where eachminipage contains the data of only asingle attribute (table column). Align-ing records on cache line boundarieswithin each minipage and taking advan-tage of cache prefetching logic im-proves the performance of scan opera-tions without sacrificing full-record ac-cess.

Clotho adjusts the position and size ofeach minipage within the frame. Itmatches the minipage size to the (mul-tiple of) block size of the storage de-vice, provided by Atropos, and decideswhich minipages to store within a singlepage based on the needs of a givenquery. Hence, minipages within a singlememory frame contain only the at-tributes needed by that query. The re-maining attributes constituting the fullrecord, but not needed by the query, arenever requested.

The skeleton page inside the memoryframe includes a header that lists the at-tributes and the range of records to beretrieved from the storage device and

Continued on page 9

Figure 1: The Fates database systemarchitecture. Efficient transformation of on-disklayout to in-memory page layout is achievedby DMA and scatter/gather I/O.

Jiri Schindler, Minglong Shao, Steve Schlosser, Anastassia Ailamaki & Greg Ganger

issue disk I/Os

p_hdr

p_hdr

BUFFER POOL Manager

SCAN Operator

Lachesis STORAGE Manager

Atropos LV Manager

allocate memory frame

get next page

generate volume requests

generate disk requests

fill page header

SchemaID

PageID

p_hdr

FALL 2003 9

DATABASE QUERY EXECUTION WITH FATES

stored in that frame. Clotho marks theset of attributes needed by the query inthe page header (i.e., the payload) andidentifies the range of records to be putinto the page. Hence, the page headerserves as a request from Clotho toLachesis to retrieve the desired data.

LACHESIS

The Lachesis database storage managerhandles the mapping and access tominipages located within the LBNs ofon-line storage devices. Utilizing stor-age device-provided performance char-acteristics and matching query accesspatterns (e.g., sequential scan) to thesehints, Lachesis constructs efficient I/Os.It also sets scatter/gather I/O vectors thatallow direct placement of individualminipages to proper memory frameswithout unnecessary memory copies.

Explicit relationships between indi-vidual logical blocks (LBNs), estab-lished by Atropos, allow Lachesis todevise a layout that groups together aset of related minipages. This groupingensures that all attributes belonging tothe same set of records can be accessedin parallel, while a particular attributecan be accessed with efficient sequen-tial I/Os. The relationships betweenLBNs serve as hints that let Lachesisconstruct I/Os that Atropos can executeefficiently.

The layout and content of eachminipage is transparent to both Lachesisand Atropos. Lachesis merely decideshow to map each minipage to LBNs tobe able to construct a batch of efficientI/Os. Atropos in turn, cuts these batchesinto individual disk I/Os comprising theexported logical volume. It does notcare where the data will be placed inmemory; this is decided by the scatter/gather I/O vectors set up by Lachesis.

ATROPOS

Atropos is a disk array logical volumemanager that offers efficient access inboth row- and column-major orders. Fordatabase systems, this translates into ef-ficient access to complete records aswell as scans of an arbitrary number of

Continued from page 8

table attributes. By utilizing features builtinto disk firmware and a new data lay-out, Atropos delivers the aggregatebandwidth of all disks for accesses inboth majors, without penalizing smallrandom I/O accesses.

The basic allocation unit in Atropos isthe quadrangle, which is a collection oflogical volume LBNs. A quadranglespans the entire track of a single diskalong one dimension and a small num-ber of adjacent tracks along the otherdimension. Each successive quadrangleis mapped to a different disk, much likea stripe unit of an ordinary RAID group.Hence, the RAID 1 or RAID 5 data pro-tection schemes fit the quadrangle lay-out naturally.

Atropos stripes contiguous LBNsacross quadrangles mapped to all disks.This provides aggregate streamingbandwidth of all disks for table accessesin column-major order (e.g., for singleattribute scans). With quadrangle“width” matching disk track size, se-quential accesses exploits the high effi-ciency of track-based access [2].

Accesses in the other major order (i.e.row-major order), called semi-sequen-tial, proceed to LBNs mapped diago-nally across a quadrangle, with each

LBN on a different track. Issuing allrequests together to these LBNs allowsthe disk’s internal scheduler to servicethe request with the smallest position-ing cost first, given the current disk headposition. Servicing the remaining re-quests does not incur any other posi-tioning overhead thanks to the diago-nal layout. Hence, this semi-sequentialaccess pattern is much more efficientthan reading some randomly chosenLBNs spread across the set of adjacenttracks of a single quadrangle.

Semi-sequential quadrangle access isused for retrieving minipages with allattributes comprising a full record. If thenumber of minipages/attributes does fitinto a single quadrangle, the remainingminipages are mapped to a quadrangleon a different disk. Using this mappingmethod, several disks can be accessedin parallel to retrieve full records effi-ciently.

SUMMARY

Fates is the first database system thatleverages the unique characteristics ofeach level in the memory hierarchy. Thedecoupling of data layouts at the cache/main-memory and on-line storage lev-els is possible thanks to carefully or-chestrated interactions between eachFate. Properly designed abstractionsthat hide specifics, yet allow the othercomponents to take advantage of theirunique strengths achieve efficient queryexecution at all levels of the memoryhierarchy.

REFERENCES

[1] A. Ailamaki, D. J. DeWitt, M. D.Hill, M. Skounakis: Weaving Relationsfor Cache Performance. Proc. of VLDB,169-180. Morgan Kaufmann, 2001.

[2] J. Schindler, J. L. Griffin, C. R.Lumb, G.R. Ganger. Track-aligned Ex-tents: Matching Access Patterns to DiskDrive Characteristics. Conf. on File andStorage Technologies, 259-274. UsenixAssociation, 2002.

0

0

12

24

36

48

60

72

84

4 parity

parity

parity

8

52

104

48 56

96

108

120

132

96 100

disk 0 disk 1 disk 2 disk 3

A1-A4

r0:r99

A1-A4

r100:r199

A1-A4

r200:r299

A5-A8

r0:r99

A5-A8

r100:r199

A5-A8

r200:r299

A9-A12

r0:r99

A9-A12

r100:r199

A9-A12

r200:r299

quadrangle 0 quadrangle 1 quadrangle 2 quadrangle 3

quadrangle 7quadrangle 6quadrangle 4 quadrangle 5

quadrangle 8 quadrangle 9 quadrangle 10 quadrangle 11

Figure 3: Mapping of database table with 12attributes onto Atropos logical volume withquadrangles. Each quadrangle holds 4attributes with 100 records each. The dashedarrow line shows efficient semi-sequentialaccess for retrieving complete records.

10 T H E P D L P A C K E T

AWARDS & & & & & OTHER PDL NEWS

September 2003Ganger & the PDL AwardedEquipment Grants from IBMand IntelIBM Corporation and Intel Corpora-tion have each generously donatedover $80K in equipment to provide anearly testbed for PDL’s new Self-*Storage project. The article on page 1describes Self-* Storage.

September 2003NSF Grant to Fund Self-* Stor-age ResearchPDL researchers have received a $1.5million NSF grant to pursue the Self-*Storage project, which seeks to cre-ate large-scale self-managing, self-organizing, self-tuning storage systemsfrom generic servers. The project PIis Greg Ganger (ECE and CS; Direc-tor of PDL), and the co-PIs areNatassa Ailamaki (CS), AnthonyBrockwell (Statistics), Garth Gibson(CS), and Mike Reiter (ECE and CS).

September 2003Congratulations Natassa andBabak!Natassa Ailamaki and Babak Falsafiare thrilled to announce the arrival oftheir daughter Niki Falsafi, who wasborn at Magee Women’s Hospital at7:44 a.m. on September 27.

September 20035 CMU Professors Receive NSFGrant to Study Drinking WaterQuality and Security*Faculty members Jeanne VanBriesen,Civil and Environmental Engineering

(CEE), Christos Faloutsos, ComputerScience, Anastassia Ailamaki, Com-puter Science, Mitch Small, CEE andEngineering and Public Policy, andPaul Fischbeck, Social and DecisionSciences, have received a NationalScience Foundation grant of $1.5 mil-lion for a new project called “SEN-SORS: Placement and Operation ofan Environmental Sensor Network toFacilitate Decision Making RegardingDrinking Water Quality and Security.”*From CMU newspaper The Tartan, Sept.8, 2003.

August 2003Ted and Addie Marry!

Ted Wong and Addie Tyler were mar-ried on Aug 9, 2003 in the Sage Chapelat Cornell University in Ithaca, NewYork. Ted will be defending his dis-sertation on October 24, and then heand Addie are moving to the Palo Altoarea in California where Ted will beworking at IBM.

June 2003Congratulations to Jiri andKatrina!Jiri Schindler and Katrina Van Dellenwere married at The Wiley Inn inPeru, Vermont on June 7, 2003. Jirisuccessfully defended his PhD disser-tation on August 22 and will be mov-

ing to the Boston area soon to workat EMC.

July 2003

Jiri Schindler Receives BestPaper Award at VLDB Workshop

Jiri Schindler’s paper “Matching Da-tabase Access Patterns to StorageCharacteristics,” co-authored withAnastassia Ailamaki and Greg Ganger,has received the award for Best Pa-per in the VLDB 2003 PhD Work-shop from among 34 submissions. Jiriwill present his paper at the VLDBPhD Workshop, co-located withVLDB 2003 (29th Conference onVery Large Databases) in Berlin inSeptember. The VLDB 2003 PhDWorkshop brings together PhD stu-dents working on topics related to theVLDB Conference series, to presentand discuss their research in a con-structive and international atmosphere.This paper is available on our publi-cations page.

May 2003John Linwood Griffin ReceivesIntel FellowshipOur congratulations to John LinwoodGriffin, who has been selected as arecipient of a 2003-04 Intel Founda-tion PhD Fellowship Award. The fel-lowship will cover John’s full tuition,fees, and stipend for the year. Addi-

Continued on page 11

FALL 2003 11

States with tal-e n t e d ,d o c t o r a l l ytrained Ameri-can men andwomen whowill lead state-of-the-art re-search projectsin disciplineshaving the greatest payoff to nationalsecurity requirements.” Since theprogram’s inception 14 years ago, ap-proximately 1,800 fellowships havebeen awarded from about 28,500 ap-plications received. James’ fellowshipis supported by the Air Force Officeof Scientific Research (AFOSR) andcovers his full tuition and required feesduring that term. Fellows have no mili-tary or other service obligations, andmust be working towards a PhD.

February 2003

Welcome Michelle!Michelle Liu was born to MengzhiWang and Honliang Liu on February2, 2003 at 7 lbs. 9 oz. and 17.2 inches.It looks like she is already followingher Mom’s footsteps into computer re-lated research.

January 2003Chris Long and Greg GangerReceive Funding from C3SChris Long and Greg Ganger havebeen awarded seed funding from theCenter for Computer Security (C3S)at Carnegie Mellon for their project“Access Control for the Masses.”The project will fall within a new PDLresearch area dealing with Better UserInterfaces.

AWARDS & & & & & OTHER PDL NEWS

tionally, the fel-lowship pro-vides John withan Intel-basedlaptop and amentor whowill act as a linkbetween thestudent andthose peoplepursuing rel-

evant research at Intel. The fellow-ship does not involve an internship;rather, it is targeted at Ph.D. candi-dates within 18 months of degreecompletion. Approximately 35 candi-dates are selected annually for theaward from a very competitive field.

May 2003

Brandon Salmon Awarded NSFGraduate Research Fellowship*

The winners ofthis year’s Na-tional ScienceF o u n d a t i o n(NSF) Gradu-ate ResearchFellowships in-clude Electricaland ComputerEngineering(ECE) students Jennifer Morris andBrandon Salmon. The NSF’s Gradu-ate Research Fellowship funds threeyears of graduate study, including a$27,500 stipend for the first 12 monthsand an annual tuition allowance of$10,500, paid to the university. Thisyear’s contest was the most competi-tive in recent history: 7,788 applicantsvied for 900 fellowships.

*CMU 8 1/2 x 11 News, May 1, 2003.

May 2003

James Hendricks AwardedNational Defense Fellowship

Congratulations to James Hendricks,who has been awarded a NationalDefense Science and EngineeringGraduate (NDSEG) Fellowship. Theprevailing goal of this highly competi-tive program is “to provide the United

Continued from page 10February 2003

Congrats to Chris & AlexisLong!

Robert Nicholas Long joined his par-ents Chris and Alexis on Feb. 16, 2003!What a bright and happy looking littlefellow.

October 2002Welcome tothe NewestSeshan!Srini Seshan andhis wife Asha wel-comed their firstbaby on October1, 2002. SanjaySeshan was 7lbs.2oz. and 20 3/4inches long atbirth. In the photoat 11 months ofage, it looks likehe has grownquite a bit sincethen!

Bruce Worthington of Microsoft discussingresearch with Steve Schlosser, Jiri Schindler,Brandon Salmon & Craig Soules.

12 T H E P D L P A C K E T

SELF-* STORAGE

their performance and reliability status,and exchange information with humanadministrators. Dataset assignmentsand redundancy schemes are dynami-cally adjusted based on observed andprojected performance and reliability.We refer to self-* collections of stor-age bricks as storage constellations.

Administration and Organization. Atthe top level, a self-* storage systemwill still require human administratorsto provide guidance, approve procure-ment requests, physically install andrepair equipment, and provide high-level goals for the system. A self-* stor-age system will need an administrativeinterface to provide information andoffer solutions to the human adminis-trator when problems arise or trade-offs (e.g., between performance andreliability) are faced. A self-* adminis-trative interface should also help ad-ministrators decide when to acquirenew components, which would thenbe automatically integrated into theself-* constellation.

The supervisors, processes playing anorganizational role in the infrastructure,form a management hierarchy. Theydynamically tune dataset-to-workerassignments, redundancy schemes forgiven datasets, and router policies. Thehierarchy of supervisor nodes controlsdata partitioning and request distribu-tion among workers, with the objec-tive of partitioning data and goalsamong its subordinates (workers orlower-level supervisors) such that, ifits children meet their assigned goals,the goals for the entire subtree will bemet. By communicating goals downthe tree, a supervisor gives its subordi-nates the ability to assess their own per-formance relative to goals as they in-ternally tune; the supervisors need notconcern themselves with details of howsubordinates meet their goals. The topof the hierarchy interacts with the sys-tem administrator, receiving high-levelgoals for datasets and providing statusand procurement requests. Additionalinternal services, referred to as admin-istrative assistants, are also needed fora self-* constellation to function. Ex-amples include event logging for prob-lem diagnosis, directory services to help

translate component/service names totheir locations in the network, and se-curity services.

Data Access and Storage. Workersstore data and routers ensure that I/Orequests are delivered to the appropri-ate workers for service. Thus, self-*clients interact with a self-* router toaccess data. We envision two types ofself-* clients. Trusted clients are con-sidered a part of the system, and maybe physically co-located with othercomponents (e.g., router instances);examples are file servers or databasesthat use the self-* constellation as back-end storage. Untrusted clients are modi-fied to support the self-* constellation'sinternal protocols; they interact directlywith self-* workers, via self-* rout-ers, with access privileges verified oneach request.

An important part of a self-* router’sjob is correctly handling accesses todata stored redundantly across storagenodes. Doing so requires a protocol tomaintain data consistency and livenessin the presence of failures andconcurrency. Since they will have flex-ibility in deciding which servers shouldhandle certain requests, self-* routerswill also have a role in dynamic loadbalancing as they deliver client requeststo the appropriate workers, particularlywhen new data are created and when

READ requests access redundant data.Doing so requires metadata for track-ing current storage assignments, con-sistency protocols for accessing redun-dant data, and choices for routing re-quests.

Self-* workers will service requests forand store assigned data. We expectthem to have the computation andmemory resources needed to internallyadapt to their observed workloads by,for example, reorganizing on-diskplacements and specializing cache poli-cies. Workers will also handle storageallocation internally, both to decoupleexternal naming from internal place-ments and to allow support for internalversioning, since they will keep histori-cal versions (e.g., snapshots) of all datato assist with recovery from datasetcorruption. Although the self-* storagearchitecture would work with work-ers as block stores (like SCSI or IDE/ATA disks), we believe they will workbetter with a higher-level abstraction(e.g., objects or files), which will pro-vide more information for adaptive spe-cializations. Self-* workers must alsoprovide support for a variety of main-tenance functions, including crash re-covery, integrity checking, and datamigration.

Continued from page 1

Continued on page 20

Figure 2: “Head-end” servers bridge external clients into the self-* storage constellation.

head-end 1

WRITE (D)

head-end N

READ(D)

user group 1

user group N

bridges into the storeunmodified user systems

self-* storageconstellation

FALL 2003 13

PROPOSALS & & & & & DEFENSES

PH.D. DISSERTATIONMatching Application AccessPatterns to Storage DeviceCharacteristics

Jiri Schindler, ECEAugust 22, 2003

Thesis statement: “With sufficient in-formation, a storage manager can ex-ploit unique storage device character-istics to achieve better, more robust I/O performance. This information canbe abstract from device specifics, de-vice-independent, and yet expressiveenough to allow a storage manager totune its access patterns to a given de-vice.”

This dissertation contends that stor-age device resources are not utilizedto their full potential because too muchis hidden behind their high-level stor-age interfaces. Current storage inter-faces do not convey sufficient infor-mation to the storage manager to en-able it to make informed decisionsleading to the most efficient use ofthe storage device. To bridge the in-formation gap between hosts and stor-age devices, the storage device shouldexplicitly state its performance char-acteristics. Using this static informa-tion, a storage manager can take ad-vantage of the device’s uniquestrengths and avoid inefficient accesspatterns.

MS THESISA Framework for ImplementingBackground Storage Applica-tions using Freeblock Schedul-ing

Eno Thereska, ECEAugust 2003

There are many disk maintenancetasks that are required for robust sys-tem operation but have loose timeconstraints. Such “background” tasksneed to complete within a reasonableamount of time, but are generally in-tended to occur during otherwise idletime. Examples include cache write-

back, defragmentation, backup, integ-rity checking, virus scanning, reportgeneration, tamper detection, and in-dex generation. Developers of suchapplications have had no clean wayof designing these applications. Themain reason for that is the traditionallack of “trust” applications have hadon storage devices to do what is bestfor the application, with consequencesreflected in the narrow interfaces be-tween the two.

We introduce a framework for imple-menting background storage applica-tions by adding a new asynchronousinterface to the storage device. Ap-plications register background tasksthrough the interface and the storagedevice notifies them of their comple-tion. The storage device usesfreeblock scheduling together with idletime detectors to guarantee that thebackground applications will makegood progress independent on the loadof the system and without impactingthe foreground workload. This frame-work is described and evaluated in thecontext of two real applications, asnapshot-based backup and a cachecleaner.

MS THESISStorage-based Intrusion Detec-tion: Watching Storage ActivityFor Suspicious Behavior

Adam Pennington, ECEAugust 2003

Please see the abstract of the paperof the same name on pg. 7 for an out-line of this thesis.

MS THESISOpportunistic Use of ContentAddressable Storage forDistributed File Systems

Niraj Tolia, ECEMay 2003

Please see the abstract of the paperof the same name on pg. 17 for anoutline of this thesis.

THESIS PROPOSALEfficient, Flexible Consistencyfor Highly Fault Tolerant StorageGarth Goodson, ECEAugust 18, 2003

Fault-tolerant storage systems spreaddata redundantly across a set of stor-age-nodes in an effort to preserve andprovide access to data despite fail-ures. One difficulty created by this ar-chitecture is the need for a consistentview, across storage-nodes, of themost recent update. Such consistencyis made difficult by concurrent up-dates, partial updates made by clientsthat fail, and failures of storage-nodes.This thesis will demonstrate how toachieve scalable, highly fault-tolerantstorage systems by leveraging an ef-ficient and flexible family of strongconsistency protocols enabled byserver versioning. In particular, thedesign of block-based storage systemsand file systems will be evaluated. Thestorage protocol is made space-effi-cient through the use of erasure codesand made scalable by offloading workfrom the storage-nodes to the clients.The protocol family is flexible in thatit covers a broad range of systemmodel assumptions with no changesto the client-server interface, serverimplementations, or system structure.Each protocol scales with its require-ments—it only does work necessitatedby the system and fault models.

THESIS PROPOSALStaged Database Systems

Stavros Harizopoulos, SCSApril 24, 2003

Thesis Statement: “By organizing andassigning system components intoself-contained stages, database sys-tems can exploit instruction and datacommonality across concurrent re-quests thereby increasing throughput.Furthermore, staged database systemsare more scalable, easier to extend,

Continued on page 18

14 T H E P D L P A C K E T

INFERRING ATTRIBUTES FROM CONTEXT

As storage capacity continues to in-crease, users find it increasingly dif-ficult to manage their files using tra-ditional directory hierarchies. At-tribute-based naming enables power-ful search and organization tools forever-increasing user data sets. How-ever, such tools are only useful incombination with accurate attributeassignment. Existing systems rely onuser input and content analysis, butthey have enjoyed minimal success.We propose several new approachesto automatically assigning attributesto files through context analysis, atechnique that has been successful inthe Google web search engine. Withextensions like application hints (e.g.,web links for downloaded files) andinter-file relationships, it should bepossible to infer useful attributes formany files, making attribute-basedsearch tools more effective.

Existing Organizational Tools

As storage capacity increases, theamount of data belonging to an indi-vidual user increases accordingly.Soon, storage capacity will reach apoint where there will be no reasonfor a user to ever delete old content— in fact, the time required to do sowould be wasted. The challenge hasshifted from deciding what to keep tofinding particular information whenit is desired. To meet this challenge,we need to improve our approach topersonal data organization.

Today, most systems provide a tree-like directory hierarchy to organizefiles. Although this is easy for mostusers to understand, it does not pro-vide the flexibility required to scaleto large numbers of files. In particu-lar, the strict hierarchy provides onlya single categorization with no cross-referenced information.

Alternatives to the standard directoryhierarchy systems generally assignattributes to files, providing the abil-ity to cluster and search for files bytheir attributes. An attribute can beany metadata that describes the file,although most systems use keywords

or <category, value> pairs. The keychallenge is assigning useful, mean-ingful attributes to files.

Unfortunately, the two most prevalentmethods of attribute assignment, userinput and content analysis, have beenlargely unsuccessful. Although usersoften have a good understanding ofthe files they create, it can be time-consuming and unpleasant to distillthat information into the right set ofkeywords. As a result, users are un-derstandably reluctant to do so. Onthe other hand, content analysis takesnone of the user’s time, and can beperformed entirely in the backgroundto eliminate any potential perfor-mance penalty. However, the com-plexity of language parsing, com-bined with the large number of pro-prietary file formats and non-textualdata types, restricts the effectivenessof content analysis.Context-based AttributesEarly web search-engines, (e.g.Lycos), relied upon user input (usersubmitted web pages) and contentanalysis (word counts, word proxim-ity, etc.). Although valuable, the suc-cess of these systems has beeneclipsed by the success of Google.

To provide better search results,Google utilizes two forms of contextanalysis. First, it uses the text associ-ated with a link to determine attributesfor the linked site. This text gives thecontext of both the creator of the link-ing site and the user who clicks onthe link at that site. The more timesthat a particular word links to a site,the higher that word is ranked for thatsite. Second, Google uses the actionsof a user after a search to decide whatthe user wanted from that search. Forexample, if a user clicks on the firstfour links of a given search, and thendoes not return, it is likely that thefourth link was the best match, pro-viding the user’s context for thosesearch terms.

Unfortunately, Google’s approach toindexing does not translate directlyinto the realm of file systems. Much

of the information that Google relieson does not exist within a file system.Also, Google’s query feedbackmechanism relies on two properties:users are normally looking for themost popular sites when they performa query, and they have a large userbase that will repeat the same querymany times. Conversely, in file sys-tems, users usually search for filesthat have not been accessed in a longtime, because they usually rememberwhere recently accessed files reside,and there is generally only a singleuser for each set of files, making itunlikely that frequent queries will begenerated for any given file.

Context-based Attributes in FileSystemsWe are investigating four approachesto automatically gathering contextinformation for use in file systems.The first two focus on gathering at-tributes when a file is created or ac-cessed. The second two focus onpropagating attributes among relatedfiles to increase the coverage of at-tribute assignment. Together, thesetechniques should categorize a muchbroader set of files than creation-based attribute assignment alone.

Application assistance: Althoughcomputers provide a vast array offunctionality, most people use theircomputer for a limited set of tasksusing a small set of applications that,in turn, access and create most of theuser’s files. Modifying these applica-tions to provide hints about the user’scontext could provide invaluable at-tribute information.

Existing user input: Although mostusers are not willing to input addi-tional information, they are willing tochoose a directory and name for theirfiles. Each of the sub-directoriesalong the path and the file name it-self probably contain context infor-mation that can be used to assign at-tributes. For example, if the userstores a file in “/home/papers/FS/At-tribute-based/Semantic91.ps,” then it

Continued on page 15

Craig Soules & Greg Ganger

FALL 2003 15

INFERRING ATTRIBUTES FROM CONTEXT

is likely that they believe the file is a“paper” having to do with “FS,” “at-tribute-based,” and “semantic.”

User access patterns: As users accesstheir files, the pattern of their accessesprovides a set of temporal relation-ships between files. A possible use ofthis information is to help propagateinformation between related files. Forexample, accessing “SemanticFS.ps”and “Gopal.ps” followed by updating“related.tex” may indicate a relation-ship between the three files. Subse-quently, accessing “related.tex” andcreating “FindingFiles.ps” may indi-cate a transitive relationship.

Inter-file content analysis: Contentanalysis will continue to be an impor-tant part of automatically assigningattributes. In addition to existing per-file analysis techniques, our focus oncreating context-based connectionsbetween files suggests another sourceof attributes: content-based relation-ships. For example, some current filesystems use hashing to eliminate du-plicate blocks within a file system, oreven locate similarities on non-blockaligned boundaries. Such contentoverlap could also be used to identifyrelated files, by treating files withlarge matching data sets as related.Similarly, users (or the system) willoften keep several slightly differentversions of a file. Although these filesgenerally contain differences, oftenthe inherent information containedwithin does not change (e.g., a usermay keep three instances of their re-sume, each focused for a differenttype of job application). This gives thesystem two opportunities for contentanalysis. First, content comparison canidentify related files. Second, by per-forming content analysis solely on thedifferences between versions, it maybe possible to determine version-spe-cific attributes, making it easier forusers to locate individual version in-stances.

Prototype Evaluation System

Figure 1 shows an overview of a pro-totype system for evaluating context-

based attribute assignment schemes.The system is composed of four mainparts: the tracer, the application inter-face, the analyzer, and the database.The tracer keeps a trace of all filesystem activity in the system. Any filesystem calls made by applications aretracked and stored in a file for lateroffline analysis. This allows a singlesystem to employ a variety of differ-ent analysis techniques. The applica-tion interface allows applications topass context information into the sys-tem, such as email header informa-tion or link information from a webbrowser. This information is used bythe analyzer to generate attributes forfiles. The analyzer combines applica-tion information, and offline traceanalysis to generate attributes forfiles. All updated attribute informa-tion is passed to the database, whichprovides the search interface to theapplication. It allows applications tolocate files using the file attributesassigned by the analyzer. Feedbackfrom the search results is pushed tothe analyzer for further attribute re-finement.This design could include multiple da-tabases. In order to compare theresults of different trace analysis al-gorithms, the analyzer could maintaina database for each, and users could

compare the results of the differentapproaches. For more information onthis project see Soules [1] or the PDLproject page at www.pdl.cmu.edu/AttributeNaming/.

References[1] Craig A.N. Soules, Greg Ganger.Why Can’t I Find My Files? Newmethods for automating attribute as-signment. Proceedings of the NinthHotOS Workshop, USENIX Associa-tion, May 2003.

Figure 1: A prototype system for evaluating context-based attribute assignment schemes.

Continued from page 14

Applications Database

Analyzer

Application Interface

Operating System

Tracer

File System

Young Professor Tim Ganger teaching the ECE18-746 storage systems class.

16 T H E P D L P A C K E T

RECENT PUBLICATIONS

tively limited resources (CPU,memory and/or communication band-width and power) poses some inter-esting challenges. We need both pow-erful and concise "languages" to rep-resent the important features of thedata, which can (a) adapt and handlearbitrary periodic components, includ-ing bursts, and (b) require littlememory and a single pass over thedata.

This allows sensors to automatically(a) discover interesting patterns andtrends in the data, and (b) performoutlier detection to alert users. Weneed a way so that a sensor can dis-cover something like "the hourly phonecall volume so far follows a daily anda weekly periodicity, with burstsroughly every year," which a humanmight recognize as, e.g., the Mother'sday surge. When possible and if de-sired, the user can then issue explicitqueries to further investigate the re-ported patterns.

In this work we propose AWSOM(Arbitrary Window Stream mOdelingMethod), which allows sensors oper-ating in remote or hostile environmentsto discover patterns efficiently andeffectively, with practically no userinterventions. Our algorithms requirelimited resources and thus can be in-corporated in individual sensors, pos-sibly alongside a distributed query pro-cessing engine. Updates are per-formed in constant time, using sub-linear (in fact, logarithmic) space.Existing, state of the art forecastingmethods (AR, SARIMA, GARCH,etc) fall short on one or more of theserequirements. To the best of ourknowledge, AWSOM is the firstmethod that has all the above charac-teristics.

Experiments on real and syntheticdatasets demonstrate that AWSOMdiscovers meaningful patterns overlong time periods. Thus, the patternscan also be used to make long-rangeforecasts, which are notoriously diffi-cult to perform automatically and ef-ficiently. In fact, AWSOM outper-

forms manually set up auto-regressivemodels, both in terms of long-termpattern detection and modeling, aswell as by at least 10x in resourceconsumption.

Location-based Node IDs:Enabling Explicit Locatlity inDHTs

Zhou, Ganger & SteenkisteCarnegie Mellon University School ofComputer Science Technical ReportCMU-CS-03-171, August 2003.

Current peer-to-peer systems basedon DHTs struggle with routing local-ity and content locality because ofrandom node ID assignment. To ad-dress these issues, we promote theuse of location-based node IDs toencode physical topology and improverouting. This gives applications explicitknowledge about and control overdata locality at a coarse-grain. Appli-cations can place content in particu-lar regions or route towards a closereplica. Schemes to address the diffi-culties that ensue, particularly loadimbalance, are discussed.

GEM: Graph EMbedding forRouting and Data-CentricStorage in Sensor NetworksWithout Geographic Information

Newsome & Song

Proceedings of the First ACM Con-ference on Embedded NetworkedSensor Systems (SenSys 2003). No-vember 5-7, 2003, Redwood, CA.

The widespread deployment of sen-sor networks is on the horizon. Oneof the main challenges in sensor net-works is to process and aggregate datain the network rather than wastingenergy by sending large amounts ofraw data to reply to a query. Someefficient data dissemination methods,particularly data-centric storage andinformation aggregation, rely on effi-cient routing from one node to another.In this paper we introduce GEM(Graph EMbedding for sensor net-

works), an infrastructure for node-to-node routing and data-centric storageand information processing in sensornetworks. Unlike previous ap-proaches, it does not depend on geo-graphic information, and it works welleven in the face of physical obstacles.In GEM, we construct a labeled graphthat can be embedded in the originalnetwork topology in an efficient anddistributed fashion. In that graph, eachnode is given a label that encodes itsposition in the original network topol-ogy. This allows messages to be effi-ciently routed through the network,while each node only needs to knowthe labels of its neighbors. To demon-strate how GEM can be applied, wehave developed a concrete graph em-

Continued from page 7

Continued on page 17

[0:25] [26:50] [51:75] [76:100]

[0:100]

S D

[0:50] [51:100]

Naive Tree

4 hops

[0:25] [26:50] [51:75] [76:100]

[0:100]

S D

[0:50] [51:100]

Smart Tree

3 hops

[0:25] [26:50] [51:75] [76:100]

[0:100]

S D

[0:50] [51:100]

VPCR

2 hops

Virtual polar coordinate routing for VPCS. Apacket is routed from S to D using the threerouting algorithms. Smart Tree and VPCR usea 1-hop neighborhood.

FALL 2003 17

RECENT PUBLICATIONS

bedding method, VPCS (Virtual Po-lar Coordinate Space). In VPCS, weembed a ringed tree into the networktopology, and label the nodes in sucha manner as to create a virtual polarcoordinate space. We have also de-veloped VPCR, an efficient routing al-gorithm that uses VPCS. VPCR is thefirst algorithm for node-to-node rout-ing that guarantees reachability, re-quires each node to keep state onlyabout its immediate neighbors, andrequires no geographic information.Our simulation results show thatVPCR is robust on dynamic networks,works well in the face of voids andobstacles, and scales well with net-work size and density.

Opportunistic Use of ContentAddressable Storage forDistributed File Systems

Tolia, Kozuch, Satyanarayanan,Karp, Bressoud & Perrig

Proceedings USENIX Annual Tech-nical Conference, General Track2003: 127-140, June 9-14, San Anto-nio, TX.

Motivated by the prospect of readilyavailable Content Addressable Stor-age (CAS), we introduce the conceptof file recipes. A file's recipe is a first-class file system object listing contenthashes that describe the data blockscomposing the file. File recipes pro-vide applications with instructions forreconstructing the original file fromavailable CAS data blocks. We de-scribe one such application of reci-pes, the CASPER distributed file sys-tem. A CASPER client opportunisti-cally fetches blocks from nearby CASproviders to improve its performancewhen the connection to a file servertraverses a low-bandwidth path. Weuse measurements of our prototypeto evaluate its performance undervarying network conditions. Our re-sults demonstrate significant improve-ments in execution times of applica-tions that use a network file system.We conclude by describing fuzzy

block matching, a promising techniquefor using approximately matchingblocks on CAS providers to reconsti-tute the exact desired contents of afile at a client.

Metadata Efficiency in aComprehensive Versioning FileSystem

Soules, Goodson, Strunk & Ganger

2nd USENIX Conference on File andStorage Technologies, San Francisco,CA, Mar 31- Apr 2, 2003.

A comprehensive versioning file sys-tem creates and retains a new file ver-sion for every WRITE or other modi-fication request. The resulting historyof file modifications provides a de-tailed view to tools and administratorsseeking to investigate a suspect sys-tem state. Conventional versioning

systems do not efficiently record themany prior versions that result. Inparticular, the versioned metadata theykeep consumes almost as much spaceas the versioned data. This paper ex-amines two space-efficient metadatastructures for versioning file systemsand describes their integration into theComprehensive Versioning File Sys-tem (CVFS). Journal-based metadataencodes each metadata version intoa single journal entry; CVFS uses thisstructure for inodes and indirectblocks, reducing the associated spacerequirements by 80%. Multiversion b-trees extend the per-entry key with atimestamp and keep current and his-torical entries in a single tree; CVFSuses this structure for directories, re-ducing the associated space require-ments by 99%. Experiments withCVFS verify that its current-versionperformance is similar to that of non-versioning file systems. Although ac-cess to historical versions is slowerthan conventional versioning systems,checkpointing is shown to mitigate thiseffect.

Byzantine-tolerantErasure-coded Storage

Goodson, Wylie, Ganger & Reiter

Carnegie Mellon University SCSTechnical Report CMU-CS-03-187,September, 2003.

This paper describes a decentralizedconsistency protocol for survivablestorage that exploits data versioningwithin storage-nodes. Versioning en-ables the protocol to efficiently pro-vide linearizability and wait-freedomof read and write operations to era-sure-coded data in asynchronous en-vironments with Byzantine failures ofclients and servers. Exploitingversioning storage-nodes, the proto-col shifts most work to clients. Readsoccur in a single round-trip unless cli-ents observe concurrency or writefailures. Measurements of a storagesystem using this protocol show that

Continued from page 16

Continued on page 19

1. File Read

4. CAS Request

5. CAS Reply

File

Write

s

CAS

Storage

Jukebox

Recipe

Server

CodaFile

Server

Server

Coda

Client

Proxy

Client

LAN Connection

WA

N C

on

ne

ctio

n

2.

Re

cip

e R

eq

ue

st

3.

Re

cip

e R

esp

on

se

6.

Mis

se

d B

lock R

eq

ue

st

7.

Mis

se

d B

lock R

esp

on

se

System diagram.

18 T H E P D L P A C K E T

PROPOSALS & & & & & DEFENSES

and more readily fine-tuned than tra-ditional database systems.”

Database system architectures facea rapidly evolving operating environ-ment where millions of users store andaccess terabytes of data. To cope withincreasing demands for performancehigh- end DBMS employ parallel pro-cessing techniques coupled with aplethora of sophisticated features.However, the widely adopted work-centric thread-parallel executionmodel entails several shortcomingsthat limit server performance, themost important being failure to exploitinstruction and data commonalityacross concurrent requests. More-over, the monolithic approach inDBMS software has lead to complexdesigns which are difficult to extend.

This thesis introduces a staged designfor high-performance, evolvableDBMS that are easy to fine-tune andmaintain. I propose to break the data-base system into modules and encap-sulate them into self-contained stagesconnected to each other throughqueues. The staged, data-centric de-sign remedies the weaknesses of mod-ern DBMS by providing solutions at(a) the hardware level: it optimallyexploits the underlying memory hier-archy, and (b) at a software engineer-ing level: it is more scalable, easier toextend, and more readily fine-tunedthan traditional database systems.

THESIS PROPOSAL

Using MEMS-based StorageDevices in Computer SystemsSteve Schlosser, ECEJune 5, 2003

MEMS-based storage is an interest-ing new technology that promises tobring fast, non-volatile, mass data stor-age to computer systems. MEMS-based storage devices (MEMStores)themselves consist of several thousandread/write tips, analogous to the read/write heads of a disk drive, which readand write data in a recording medium.This medium is coated on a moving

rectangular surface that is positionedby a set of MEMS actuators. Accesstimes are expected to be less than amillisecond with power consumption10–100X less than a low-power diskdrive, while streaming bandwidth andvolumetric density are expected to bearound those of disk drives.

We are starting to exploring howMEMStores would best be used incomputer systems and how those sys-tems should adapt to their differencesas compared to disks. For example,existing operating system policies aretuned for disks, including requestscheduling, data layout, and powerconservation. Also, given the perfor-mance, capacity, and non-volatility ofMEMStores, they represent a new,intermediate member of the memoryhierarchy. My thesis is that most ofthese aspects can conform, with littlepenalty, to disk-like policies and us-ages.

Because MEMStores perform basi-cally like fast disks, with only a fewexceptions, they can be treated by sys-tems as such. My dissertation willshow that for most workloads, thesame linear logical block abstractionthat is used for disk drives is appro-priate for MEMStores. The benefit ofusing the same abstraction is thatMEMStores can be easily integratedinto computer systems with little orno change.

THESIS PROPOSAL

D-SPTF: Decentralized Schedul-ing for Storage Bricks

Christopher Lumb, ECEAugust 19, 2003Distributed Shortest-Positioning TimeFirst (D-SPTF) is a request distribu-tion protocol for decentralized systemsof storage servers.

D-SPTF exploits high-speed intercon-nects to dynamically select whichserver, among those with a replica,should service each read request. Indoing so, it simultaneously balancesload, exploits the aggregate cache

capacity, and reduces positioning timesfor cache misses. For network laten-cies of up to 0.5ms, D-SPTF performsas well as would a hypothetical cen-tralized system with the same collec-tion of CPU, cache, and disk re-sources. Compared to existing decen-tralized approaches, such as hash-based request distribution, D-SPTFachieves up to 50% higher through-put and adapts more cleanly to heter-ogenous server capabilities.

THESIS PROPOSAL

Autonomous Spatio-TemporalData Mining

Spiros Papadimitriou, SCSMay 5, 2003

The goal of data mining is to facilitatethe extraction of useful informationfrom large collections of data. Thus,eliminating the requirement of userintervention is essential. We proposeto develop fast tools for spatio-tem-poral data mining towards that goal.We have completed work in the areaof spatio-temporal data mining and, inparticular, outlier detection and timeseries modeling. These provide suffi-cient evidence that we can improveupon previous techniques.

THESIS PROPOSALPrefetching and LocalityOptimizations for DatabaseMemory Hierarchy Performance

Shimin Chen, SCSMay 2, 2003

Database performance studies havebeen traditionally focused on I/O per-formance. Recently, researchers haveshown that, on traditional disk-orienteddatabases, roughly 50% or more of theexecution time in memory is wasteddue to cache misses. Therefore, toexploit the full power of modern com-puter systems requires optimizing bothcache and disk performance in thememory hierarchy, which together

Continued from page 13

Continued on page 20

FALL 2003 19

RECENT PUBLICATIONS

the protocol scales well with the num-ber of failures tolerated, and that itoutperforms a highly-tuned instanceof Byzantine-tolerant state machinereplication.

Robustness Hinting forImproving End-to-EndDependability

Bigrigg

Second Workshop on Evaluating andArchitecting System Dependability(EASY). In conjunction withASPLOS-X. Sunday, 6 October 2002,San Jose, California, U.S.A.

File systems make unreasonable at-tempts to provide data to the point thatthey will block an application insteadof passing the error on to the applica-tion to handle. Transient problemssuch as network congestion or out-ages and heavily loaded systems ordenial of service attacks can lead tofailure-like situations. Alternativemechanisms have been developed forthe file system to trade performance

for robustness in an attempt to alwaysguarantee full availability of data.These mechanisms may not be nec-essary, as the application programmermay have already accounted for suchsituations. By hinting to the file sys-tem the application’s ability to handleerrors it is possible for the file systemto make better resource allocation de-cisions and improve end-to-end de-pendability.

Data Staging on UntrustedSurrogates

Flinn, Sinnamohideen, Tolia &Satyanarayanan

Proceedings 2nd USENIX Confer-ence on File and Storage Technolo-gies (FAST03), Mar 31-Apr 2, 2003,San Francisco, CA.

We show how untrusted computerscan be used to facilitate secure mo-bile data access. We discuss a novelarchitecture, data staging, that im-proves the performance of distributedfile systems running on small, storage-

Continued from page 17

YEAR IN REVIEW

Continued from page 4

Interaction and Security Systemsin Fort Lauderdale, FL.

March 2003Christos Faloutsos presented atutorial on “Data Mining theInternet” at INFOCOM, SanFrancisco, CA.

January 2003Over the past term, severalvisitors have contributed to ourStorage Systems course, includ-ing: Dave Anderson, Seagate;Steve Kleiman, NetApp; RicWheeler, EMC; Harald Skardal,NetApp; Jim Hughes,StorageTek; Richie Lary, Inde-pendent Consultant; and MarkCarlson, Sun.Stavros Harizopoulos presented“A Case for Staged Database

Systems” at the First Interna-tional Conference on InnovativeData Systems Research (CIDR),in Asilomar, CA.

December 2002

Greg Ganger chaired the sessionon Decentralized Storage Sys-tems at OSDI in Boston, MA.Timmy Ganger gave a guestlecture in 15-712 (Advanced OSand Distributed Systems).Ted Wong presented “VerifiableSecret Redistribution for ArchiveSystems” at the First Interna-tional IEEE Security in StorageWorkshop.

November 2002

SDI Speaker: Richard Golding,then of Panasas, Inc., spoke on

“The Palladio Project: A Surviv-able, Scalable Distributed StorageSystem.”DB Seminar Speaker: C. Mohanof IBM on “Future directions inData Mining: Streams, Networks,Self-similarity and Power Laws.”Christos Faloutsos gave thekeynote talk at CIKM inMcLean, VA and was also aninvited speaker at the NSFNGDM Workshop in Baltimore,MD and the N.A.S. Workshop inWashington, DC.

October 2002

SDI Speaker: John Wilkes of HPLabs on “Travelling to Rome—QoS Specifications for Auto-mated Storage System Manage-ment.”

limited pervasive computing devices.Data staging opportunisticallyprefetches files and caches them onnearby surrogate machines. Surro-gates are untrusted and unmanaged:we use end-to-end encryption and se-cure hashes to provide privacy andauthenticity of data and have designedour system so that surrogates are asreliable and easy to manage as pos-sible. Our results show that data stag-ing reduces average file operation la-tency for interactive applications run-ning on the Compaq iPAQ hand-heldby up to 54%.

Give PDL storage systems researchers snowand a plastic chair and see what happens!

20 T H E P D L P A C K E T

PROPOSALS & & & & & DEFENSES

Continued from page 18

have not been well studied for data-base systems before. My thesis is thatprefetching and locality optimizationscan effectively improve both cache anddisk performance of database systemsand that differences between thecache-to-memory and the memory-to-disk gap play a significant role in thedesign and choice of specificprefetching and locality optimizationtechniques.To validate my thesis, I revisit twoimportant classes of database algo-rithms, B+tree index algorithms andjoin algorithms. In my preliminarywork, I have exploited cacheprefetching to improve search andrange scan operations of mainmemory B+trees. I have studiedfractal prefetching B+trees, which area new type of B+trees that optimizeboth cache and disk performance. Infractal prefetching B+trees, smallercache-optimized trees are embeddedin disk pages to improve data localityfor index search. Both cacheprefetching and I/O prefetching areused to improve performance. I haveworked on improving hash join cache

performance by exploiting inter-tupleparallelism through prefetching. Formy proposed future work, I will be fo-cusing on utilizing history informationto improve join performance. Sinceupdates are relatively infrequent com-pared to joins in DSS and OLAP en-vironments, it is possible to use his-tory information about matching tuplesto guide and improve future join per-formance. In contrast to previousstudies with join indices, I want toanalyze the history information toidentify data locality in the joining re-lations and then improve join perfor-mance by exploiting the data localityand using prefetching.

THESIS PROPOSAL

Prototyping WithoutPrototyping: Evaluating hypo-thetical storage components inreal systems

John Linwood Griffin, ECEAugust 1, 2003

For my dissertation research I am con-tinuing our investigation into the utility

and capabilities of timing-accuratestorage emulation (TASE).

Our previous work demonstrates thatTASE is feasible for evaluating bothevolutionary changes (faster platterspeeds; modified firmware algo-rithms) and revolutionary changes(MEMS-based technology) to storagedevices.

Several interesting questions remainbefore this new storage evaluationtechnique can be fully utilized by re-searchers and developers. For ex-ample, an emulator may have to mea-sure and deal with variable externally-induced timing errors during an ex-periment. How can an evaluator restassured that the emulator is correctlycompensating for these errors? Asanother example, an emulator mayneed to keep data in RAM in order toprovide per-request data before eachrequest completes. How can an emu-lator meet such deadlines when theexperimental working set is largerthan the emulator's RAM?

Ursa Minor and Ursa Major

In order to effectively explore how ourideas will simplify storage administra-tion, it is essential that operational sys-tems be built and deployed. The Paral-lel Data Lab has been developing tech-nologies relevant to the self-* storagearchitecture for several years, allow-ing us to build a prototype relativelyquickly. Our initial focus will be onimplementing the data protection as-pects, embedding instrumentation, andenabling experimentation with perfor-mance and diagnosis.

Our first prototype, named Ursa Mi-nor, will be approximately 15 TB spreadover 45 small-scale storage bricks. Themain goal of this first prototype is rapid(internal) deployment to learn lessons

SELF-* STORAGE

Continued from page 12

for the design of a second, larger-scaleinstantiation of self-* storage. UrsaMajor will be a large-scale (~1 PB) stor-age constellation, called Ursa Major. Itsdata storage capacity will be availableto research groups around CarnegieMellon (e.g. groups involved data min-ing and scientific visualization) who relyon large quantities of storage for theirwork. We are convinced that such de-ployment and maintenance is necessaryto evaluate self-* storage’s ability tosimplify administration for systemscales and workload mixes that tradi-tionally present difficulties. As well, ouruse of low-cost hardware and imma-ture software will push the boundariesof fault-tolerance and automated recov-ery mechanisms, which are critical forstorage infrastructures. From the per-

spective of their users, our early self-*constellations will look like really big,really fast, really reliable file servers (ini-tially NFS version 3). The decision tohide behind a standard file server inter-face was made to reduce software ver-sion complexities and user-visiblechanges-user machines can be unmodi-fied, and all bug fixes and experimentscan happen transparently behind a stan-dard file server interface. The result-ing architecture is illustrated in Figure2. Direct, untrusted client access willbe added over time.

For more information, please see theSelf-* Storage project page athttp://www.pdl.cmu.edu/SelfStar/.