Transcript
Page 1: What Data Do We Need and Why Do We Need It?

What Data Do We Needand Why Do We Need It?

Jim Pepin

Chief Technology Officer

University of Southern California

Page 2: What Data Do We Need and Why Do We Need It?

Network Data: Research Depends on It

Solutions depend on understanding the problem…Advances in many areas depend on analysis of real data• Network Management: Traffic engineering, net design• Network Control: Improving routing protocols• High Performance: Better transport protocols• Security: Tracking/stopping DoS and worm attacksOver 30% of papers in top networking conference

(SIGCOMM’04) depended on data collected by othersMost common providers: • ISPs (e.g., ATT, Sprint, I2)• Service Providers (e.g., Akamai)• Individual campuses (e.g., UNC, UOregon, USC – some

campuses give data only to local researchers)

Page 3: What Data Do We Need and Why Do We Need It?

Network Data: More than Just Packet Traces

Some data more sensitive than others• Dynamic routing information: routing protocol

advertisements• Static design information: Router configuration files,

peering arrangements, policies• Operational events: alarms, trouble tickets (very few

sources of this important info!)• Traffic logs: netflow records, packet header traces• Application data: URLs, p2p filenames, DNS queries

Tension – how much correlation to permit?• Data that can be correlated across multiple sites most

valuable in measuring network-wide events, e.g. worms• Techniques for privacy anonymize and blur identity

Page 4: What Data Do We Need and Why Do We Need It?

Example of Data Provider

DHS PREDICT• DHS support for network research• Not for operational use by DHS• Major Players• Peer review ground rules• Generic sources for legitimate research

LANDER Project• Example of PREDICT supplier• Joint project of USC-ISI networking division and USC/ISD Center

for High Performance Computing and Communications– USC-HPCC is manager of WAN for USC/CIT/JPL.– ISI provides networking research background– HPCC provides data storage and computational resources– We work together on ground rules and MOUs– LANDER funds collection systems, support staff and disk/tape

space

Page 5: What Data Do We Need and Why Do We Need It?

What is hard and easy

LANDER ground rules• Scrambled headers is primary product today• Requires MOU with researcher• No collection of data payloads.• Working on very strict MOU for very limited use of non-

scrambled header data for very select uses in very controlled environment.

• Build collection management system integrated with other PREDICT sites.

How we do this• Very close co-operation between ISI, ISD and university legal• MOUs will be very clear and understandable for the researcher• USC can reject any application• USC will review any publication based on unscrambled headers

and all work processing these headers will be done inside HPCC

Page 6: What Data Do We Need and Why Do We Need It?

Why would we do this

The Internet needs to be studied and engineered• What is the modern equivalent of Bell Labs for phone system?• How did we get to where we are today?

– Co-operation between researchers and operators.

• We can’t allow ourselves to have complete bunker mentality• We need to be selective in what we provide, but in case of

demonstrated need provide what is needed consistent with policies

• If we don’t do this no one will• The risks can be managed if we take the time and effort to work

with campus management (legal, CIOs etc) to mitigate• Researchers can be brought into these discussions if cast

correctly• If we don’t study how the network works our ability to manage it

will degrade to zero over time