14
“Creating Data Repositories..” Sanjay Rao ECE Dept, Purdue University

“Creating Data Repositories..”

Embed Size (px)

DESCRIPTION

“Creating Data Repositories..”. Sanjay Rao ECE Dept, Purdue University. Group Members. Dave Maltz Rebecca Issacs Ratul Mahajan Yin Zhang Aditya Akella David Kotz Charles DiFatta …. Motivation. Network Management Research: Barrier to entry is high - PowerPoint PPT Presentation

Citation preview

Page 1: “Creating Data Repositories..”

“Creating Data Repositories..”

Sanjay Rao

ECE Dept, Purdue University

Page 2: “Creating Data Repositories..”

Group Members

• Dave Maltz• Rebecca Issacs• Ratul Mahajan• Yin Zhang• Aditya Akella• David Kotz• Charles DiFatta• …..

Page 3: “Creating Data Repositories..”

Motivation

• Network Management Research:– Barrier to entry is high– Data/insights from operators/industry critical

• Examples:– Failure characterization of enterprise network– VLAN characterization and use– Configuration Management

Page 4: “Creating Data Repositories..”

What happens today..?

• End-user centric measurement studies– Network “black-box”: no operator involvement– Real need: “white-box”

• Campus Networks– Difficulties in bootstrapping relationships with operators

• Enterprise/Operator Network– Sprint or AT&T (Microsoft with end-user)– Limited pool of researchers

• Data across multiple enterprises??• Trends over many years ??

Page 5: “Creating Data Repositories..”

Bottomline

• Need a data repository– Contributors from operators, researchers,

industry– Accessible to all researchers

• Facilitate research much like Planetlab

• Vital to have “critical mass” of researchers on Network Management– Research along high-impact real problems

Page 6: “Creating Data Repositories..”

Data Sharing: what inhibits it?

• Sensitivity of data– Security Issues (firewall policies, network structure)– Privacy Issues (records of individual activity)

• Proprietary nature of data – E.g. how many calls got, mobility models– Possible to have others use it?

• “Secret weapon” for research– Competition Vs. collaboration

• Inertia/ too much effort

Page 7: “Creating Data Repositories..”

Solutions

• Carrots/sticks to promote data sharing– “Must release data” to publish – IMC: best paper award only to work releasing

data.

• Technical ways to addressing concerns with sharing

Page 8: “Creating Data Repositories..”

Positive Example

Example: HSARPA “PREDICT”: make research on network security possible.Firewalls and IDS network security data

Page 9: “Creating Data Repositories..”

Research: Anonymization

• Hiding provider, hiding individual information• Need framework to reason about it

– What trade-offs do you make?– What risks are posed?– How to expose trade-offs in a way we can appreciate?

• Anonymization very domain specific– E.g. configuration file Vs. packet trace– Are there common themes?

• Other Models:– NDA-based– “Give me a question” -> “return answer”– “Exploratory” nature of research

Page 10: “Creating Data Repositories..”

Community effort: Cooperate on IRB

• Social Sciences:– Lots of experience with IRB

• Networking:– Lack of clear guidelines on IRB process– Admins feel happier if IRB can “sanction” things

• As community:– Must appreciate need/process for IRB– Develop guidelines for IRB process– Share IRB documents

Page 11: “Creating Data Repositories..”

Creating shareable data

• 75% of time spent figuring how to use data• Researcher needs vary

– Different forms of datum– Historical Vs. Streaming

• Dated? Trending?

– Assumptions made/gaps in data– “timing info crucial at sub-RTT level”?

• Sharing hard, many idiosyncrasies– Data collection infrastructure, annotate

Page 12: “Creating Data Repositories..”

User Diagnostics

• One-on-one: exact data provided• Create shared repository(ies)

– What data do most users want?– Is that 20% of stuff most critical to provide?

• Data Collection Tools• Meta-data part of problem

– Create data in standard formats– “Observatory”:

• How to discover, describe, explain data• Access policy, use policy

Page 13: “Creating Data Repositories..”

Other

• Streaming Data: Online Vs Offline• Scalable collection:

– What to collect? Over how long?– Compression techniques– Fine-grained: overhead, coarse-grained: information

loss• What does it take to build this infrastructure?

– Get all types of data as painlessly as possible– Massage, orchestrate data to fit researcher needs– Simple APIs to get data out – fast analysis tools– Federated Access– DataManagement - Lifecycle of data

Page 14: “Creating Data Repositories..”

Action Items

• Community-Wide Efforts:– Initiate efforts to create data repository

• How to manage? Who contributes? Who arbitrates• How much storage? Lifecycle - How long to store data?

– Create IRB guidelines for networking data • Research:

– Anonymization– Usage diagnostics -> what to collect,release: widely

applicable– Data Collection Tools, metadata information

• Industry,operators must be as actively involved as possible