ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-lec1.pdf · Data Science in Communications Networks • Internet Service

Nick Duffield Department of Electrical & Computer Engineering

Texas A&M University

ECEN 689 Special Topics in Data Science for

Communications Networks

Lecture 1 Communications Networks and Measurements

Organization

•  Instructor: Nick Duffield •  Contact: duffieldng AT tamu DOT edu ; (979) 845-7328 •  Class notes: http://cesg.tamu.edu/?p=2667 •  Class times: Mon/Wed 03:00-04:15pm, FELD 111 •  Office hours: WEB 332D, Mon/Wed 11:00am-12:00pm •  Prerequisites: graduate standing; instructor approval; working

background in probability, statistics •  Grading:

–  Homework 50%; Project 15%; Presentation 15%; Final Exam 20%; –  Discussion of homework assignments is encouraged, but homework

must be executed independently and copying is not allowed. –  Assignments must be typeset and handed in on time to receive full

credit.

Course Materials: All available online

•  Background references –  Mitzenmacher and Upfal. Probability and Computing. http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048/9781139637152 –  Peterson & Davie: Computer Networks (5th Edition) http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048//9780123850591

•  Detailed references: selections from –  Leskovec, Rajaraman & Ullman: Mining of Massive Data Sets http://www.mmds.org –  Kolaczyk: Statistical Analysis of Network Data: Methods and Models. http://link.springer.com.lib-ezproxy.tamu.edu:2048/book/10.1007%2F978-0-387-88146-1

•  Review articles and tutorials –  Duffield: Sampling for Passive Internet Measurement: A Review

•  http://projecteuclid.org/euclid.ss/1110999311 –  Cormode, Duffield: Sampling for Big Data

•  http://nickduffield.net/download/papers/Tutorial_KDD_2014.ppsx

•  Research literature references: –  Will be communicated in class notes

Objectives of the course

•  Broad description: –  Statistical and algorithmic methods for acquiring and analysing

massive, complex, and incomplete datasets. –  Applications to measurement and analysis of operational data in ISP

communication networks, routers and protocols. –  Understanding of design decisions and trade-offs between statistical

and computational goals

•  Topics on the course –  Sampling, sketching, network probing, network tomography, graph

sampling. –  Relevant background in probability, statistics, and networking

recapped as needed, with references for further reading

•  Topics NOT on this course –  Machine learning –  Hadoop, MapReduce

About me

•  Joined TAMU August 2014 from Rutgers University •  Worked for 18 years in AT&T Labs Research in New Jersey •  Previously Asst. Professor in Europe •  Undergrad/PhD in Physics and Mathematical Physics •  Research Interests

–  Streaming algorithms –  Network Measurement –  Big Data Analytics

•  Methods: statistics, algorithms, machine learning •  Applications: transportation, healthcare, engineering in general

Communications Networks and Measurement

Data Science and Big Data

•  Big Data arises in many forms: –  Physical Measurements: from science (physics, astronomy) –  Medical data: genetic sequences, detailed time series –  Activity data: GPS location, social network activity –  Business data: customer behavior tracking at fine detail

•  Why is “Big Data” is trending up? –  Availability of data in new fields –  Technological advances

•  Hardware •  Computation •  Algorithms

–  Anticipated value of analysis

Data Science in Communications Networks

•  Motivating application: Internet Service Providers (ISPs) •  Many reasons to study data science from ISP viewpoint

–  Expertise: instructor’s experience from ISP world –  Demand: data science methods developed in response to ISP needs –  Practice: methods widely used in ISP monitoring, built into routers –  Prescience: ISPs were first to hit many “big data” problems –  Variety: many different places where data science is needed

Data Science Disciplines

•  Transferable Methods –  Algorithms and Data Structures –  Probability and Statistics –  Inference and Machine Learning

•  Application domain –  This course: communications networking

Data Science in Communications Networks

•  Internet Service Providers had big data before “Big Data” –  Operational metadata concerning network usage and state

1.  Telephony call detail records –  Originating and receiving telephone number, duration, …

2.  IP traffic flow records generated by routers –  Source and destination IP address of packet flows, #packets, #bytes, …

3.  Protocol transitions –  Handovers of mobile device between wireless basestations

•  Generated continuously, 100s of Terabytes per day •  Many other operational datasets •  Used in network management over a range of timescales

–  From months (network planning) to seconds (network attack detection)

Structure of Large ISP Networks

Peering with other ISPs

Access Networks: Wireless, DSL, IPTV

City-‐level Router Centers

Backbone Links

Downstream ISP and business customers

Service and Datacenters

Network Management & AdministraHon

Measuring the ISP Network: Data Sources

Peering

Access

Router Centers

Backbone

Business

Datacenters Management

One-‐way Packet Loss & Latency AcHve probing between Measurement devices


Peering

Access

Router Centers

Backbone

Business


Roundtrip Packet Loss & Latency Monitoring both direcHons of traffic between two hosts


Peering

Access

Router Centers

Backbone

Business


Status Reports: Device failures and transiHons


Peering

Access

Router Centers

Backbone

Business


Customer Care Logs


Peering Router Centers

Backbone

Business


Protocol Monitoring: e.g. Wireless Handovers

A

B

C

D Active set: (A,B)

Active set: (C,D)


Peering

Access

Router Centers

Backbone


Link Traffic Rates Timeseries of traffic per router interface, 5 minute granularity

0:00

0:

05

0:10

0:

15

0:20

0:

25

0:30

0:

35


Peering

Access

Router Centers

Backbone

Business


IP Traffic Flow Records Generated by routers

Three challenges for ISP data analysis

•  Scale: some datasets are enormous –  IP Traffic Flow Records, Mobile device handovers,…

•  Incompleteness: –  Not all quantities can be directly measured

•  Would like to know packet loss and latency per link •  Typically only measure these on a path comprising multiple links

•  Complexity –  Complex statistical properties difficult to model

•  Noisy data, skewed distributions, 80-20 laws, correlations

•  The methods in this course tackle these challenges

?

1. Traffic Flow Measurement

•  IP Protocol layers & packet headers •  Router based traffic measurement •  Measurement design decisions •  Traffic flows, NetFlow

Protocol layers in the Internet

Network packet

payload link header

payload IP header

payload transport header

application packet

IP packet header

•  Routers: use DstIP for packet forwarding –  Determine router egress interface for the packet

•  How? –  Routers can’t store (DstIP, egress) for each possible DstIP (232 ~ 4G)

IP version 4 (IPv4) Main focus here: 32 bit Source IP address (SrcIP) 32 bit Destination IP address (DstIP) Usually written in dot decimal notation, e.g., 128.194.121.31 Also: IP Protocol (Proto) signifies which IP protocol is used in the remainder of the packet

0 15 16 31

•  Prefix = first m bits of IP address for some m ≤ 32 •  Represents a block of addresses

–  First m bits in common; remaining 32 – m bits take any value

•  CIDR notation for address block –  dot_decimal_address / prefix length e.g. 192.168.100.0 / 22 –  Comprises 232-22 = 210 addresses from 192.168.100.0 to 192.168.103.255

•  In binary notation –  192.168.100.0 = 11000000.10101000.01100100.00000000 First 22 bits common –  192.168.103.255 = 11000000.10101000.01100111.11111111

IP Prefixes

IP Routing and Prefixes

•  Routers maintain a routing table –  Routing table = lists of (DstIP_Prefix, egress) pairs; currently ~500k –  How? Routers communicate by protocols to announce and update tables

•  Forwarding Packets –  Find longest prefix (DstIP_Prefix, egress) in table that matches packet –  Forward packet to egress interface

•  More detail: Petersen & Davie, Chap 3.2 & 3.3

IP Header and Information for ISPs

•  Have seen that IP header information is used to forward packets in routers in the ISP infrastructure

•  How could an ISP use this information for network management if it could be monitored, recorded and analysed?

•  Two example uses: –  Network planning: identify potential new customers based on volumes

of traffic to or from their IP addresses –  Attack detection: detect an anomalous burst of traffic destined to a

customer

•  Many other ISP network management tasks used IP header information over range of timescales: from months to seconds

Protocol layers in the Internet

Network packet

payload link header

payload IP header

payload transport header

application packet

Transport and Other Protocols

•  Most data transmission accomplished by one of two IP transport protocols that provide the appearance of a communications channel between hosts

•  TCP: Transmission Control Protocol (Proto = 6) –  connection oriented protocol providing

•  three-way handshake to setup connection •  reliable ordered transmission •  congestion-avoidance

•  UDP: User Datagram Protocol (Proto = 17) –  connectionless, no reliability, no congestion avoidance

•  Other (non-transmission) IP protocols in common use –  ICMP: Internet Contol Message Protocol (Proto = 1)

•  used to communicate error conditions; leveraged for probing & debugging

Transport Layer Header

•  16 bit source port (SrcPrt) and destination ports (DstPrt) •  Used in both TCP and UDP •  Associate packets with applications at hosts (see: binding)

–  Ports 0-1023: well known, assigned by IANA (mostly) •  E.g. HTTP server listens in port 80 ; DNS uses port 53

–  Ports 1024-49151: registered ports •  E.g. minecraft 19132

–  Ports 49152-65535: dynamic ports

•  More detail: Petersen & Davie, Chap. 5.1, 5.2

UDP Header 0 15 16 31 Source port Destination port UDP Length UDP checksum

TCP/UDP Header and Information for ISPs

•  Have seen that UDP/TCP header information (port numbers) is used at hosts to associate packets to applications

•  Many of these associations are registered by IANA •  The identify of the application that generated a packet can be

inferred (to some degree) from transport header port numbers •  How could an ISP use this info for network management? •  Two example uses:

–  Network planning: detecting growth of new applications •  E.g. various P2P applications, but some ports dynamic or unofficial

–  Attack detection: e.g. signature of exploit of application vulnerability •  E.g. Slammer Worm: UDP port 1434 (MS SQL Server), 376 byte packets

Measuring Network Traffic

•  ISPs: useful to record packet SrcIP, DstIP, SrcPrt, DstPrt •  How can routers do this? •  Finest conceivable granularity?

–  Routers record (SrcIP,…) for each packet, export result to a collector –  Constraints: router cycles, network bandwidth for collection

•  Possible with special purpose measurement devices for limited time

•  Coarse time granularity? –  Maintain counters of packet/bytes for each (SrcIP,…) seen –  Report at fixed time interval (e.g. every hour), then reset counters to 0. –  Constraints:

•  storage: how many distinct combinations (SrcIP…) seen in each interval? •  staleness: information may lose usefulness with reporting delay

Network Traffic Flows

•  Better: –  Exploit inherent timescale of packets generated by a user application

•  Intuition: packets group into “sessions” e.g. web download, VOIP call, …

•  Abstractly, define an IP Flow: –  Set of packets with a shared property observed over some time period

•  Shared property is called the key –  Typically a tuple of fields from the IP and transport headers –  No unique definition of key; depends on purpose

•  5-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto) –  Application-to-application flow

•  2-tuple key: (SrcIP, DstIP) –  Host-to-host flow

•  7-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto, ToS, Ingress Intf) –  Used for flow measurement in some routers

Flow measurement in routers

•  Routers maintain statistics on flows in a flow table –  Each flow: key, #packets, #bytes, first & last packet times, …

•  Each packet –  If no entry for packet key in flow table, instantiate new entry

#packets(key) = #bytes(key) = 0; first_packet_time(key) = timestamp, … –  Update flow entry

#packets(key)++ ; #bytes(key) += bytes; last_packet_time(key) = timestamp, …

key 1 key 2 key 3 key 4

Hme

key, bytes, timestamp, …

key3 stats3 key2 stats2

Packet Flow table

hash(key) key1 stats1

key4 stats4

Flow termination

•  No precise definition behind intuition of flow as a “session” •  Routers use several criteria to terminate flows (configurable)

–  Protocol signals: packets TCP FIN flag is set, ending TCP connection –  Inactive timeout: time since last observed flow packet > Tinactive

–  Active timeout: time since first observed flow packet > Tactive

–  Flow table occupancy: terminate some flows if table occupancy > p%

Flow records: realization & collection

•  Statistics of terminated flow exported in flow record –  release flow table memory for new flow statistics

•  Realization –  Cisco NetFlow dominates

•  Current version 9; flow definition, export format highly configurable •  Most other router vendors offer (some version of) NetFlow

–  Embodied in Internet Engineering Task Force Standards •  IP Flow Information eXport Working Group

•  Flow record collectors –  Network management software vendors offer collector/analysers –  Some public domain tools, e.g. cflowd

•  Related approaches –  e.g. sFlow

•  The future –  Dynamically configurable measurement in software defined networking

Background Reading

NetFlow and IETF Standards •  Cisco NetFlow White Paper:

–  http://tinyurl.com/cisco-netflow-whitepaper

•  IETF IPFIX Working Group –  WG Charter: http://datatracker.ietf.org/wg/ipfix/charter/ –  Applying IPFIX Tutorial: http://www.ietf.org/edu/tutorials/ipfix-tutorial.pdf

Documents

ECEN 689 Special Topics in Data Science for Communications ...cesg.tamu.edu/wp-content/uploads/2014/09/ECEN689-lec1.pdf · Data Science in Communications Networks • Internet Service