Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Nick Duffield Department of Electrical & Computer Engineering
Texas A&M University
ECEN 689 Special Topics in Data Science for
Communications Networks
Lecture 1 Communications Networks and Measurements
Organization
• Instructor: Nick Duffield • Contact: duffieldng AT tamu DOT edu ; (979) 845-7328 • Class notes: http://cesg.tamu.edu/?p=2667 • Class times: Mon/Wed 03:00-04:15pm, FELD 111 • Office hours: WEB 332D, Mon/Wed 11:00am-12:00pm • Prerequisites: graduate standing; instructor approval; working
background in probability, statistics • Grading:
– Homework 50%; Project 15%; Presentation 15%; Final Exam 20%; – Discussion of homework assignments is encouraged, but homework
must be executed independently and copying is not allowed. – Assignments must be typeset and handed in on time to receive full
credit.
Course Materials: All available online
• Background references – Mitzenmacher and Upfal. Probability and Computing. http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048/9781139637152 – Peterson & Davie: Computer Networks (5th Edition) http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048//9780123850591
• Detailed references: selections from – Leskovec, Rajaraman & Ullman: Mining of Massive Data Sets http://www.mmds.org – Kolaczyk: Statistical Analysis of Network Data: Methods and Models. http://link.springer.com.lib-ezproxy.tamu.edu:2048/book/10.1007%2F978-0-387-88146-1
• Review articles and tutorials – Duffield: Sampling for Passive Internet Measurement: A Review
• http://projecteuclid.org/euclid.ss/1110999311 – Cormode, Duffield: Sampling for Big Data
• http://nickduffield.net/download/papers/Tutorial_KDD_2014.ppsx
• Research literature references: – Will be communicated in class notes
Objectives of the course
• Broad description: – Statistical and algorithmic methods for acquiring and analysing
massive, complex, and incomplete datasets. – Applications to measurement and analysis of operational data in ISP
communication networks, routers and protocols. – Understanding of design decisions and trade-offs between statistical
and computational goals
• Topics on the course – Sampling, sketching, network probing, network tomography, graph
sampling. – Relevant background in probability, statistics, and networking
recapped as needed, with references for further reading
• Topics NOT on this course – Machine learning – Hadoop, MapReduce
About me
• Joined TAMU August 2014 from Rutgers University • Worked for 18 years in AT&T Labs Research in New Jersey • Previously Asst. Professor in Europe • Undergrad/PhD in Physics and Mathematical Physics • Research Interests
– Streaming algorithms – Network Measurement – Big Data Analytics
• Methods: statistics, algorithms, machine learning • Applications: transportation, healthcare, engineering in general
Communications Networks and Measurement
Data Science and Big Data
• Big Data arises in many forms: – Physical Measurements: from science (physics, astronomy) – Medical data: genetic sequences, detailed time series – Activity data: GPS location, social network activity – Business data: customer behavior tracking at fine detail
• Why is “Big Data” is trending up? – Availability of data in new fields – Technological advances
• Hardware • Computation • Algorithms
– Anticipated value of analysis
Data Science in Communications Networks
• Motivating application: Internet Service Providers (ISPs) • Many reasons to study data science from ISP viewpoint
– Expertise: instructor’s experience from ISP world – Demand: data science methods developed in response to ISP needs – Practice: methods widely used in ISP monitoring, built into routers – Prescience: ISPs were first to hit many “big data” problems – Variety: many different places where data science is needed
Data Science Disciplines
• Transferable Methods – Algorithms and Data Structures – Probability and Statistics – Inference and Machine Learning
• Application domain – This course: communications networking
Data Science in Communications Networks
• Internet Service Providers had big data before “Big Data” – Operational metadata concerning network usage and state
1. Telephony call detail records – Originating and receiving telephone number, duration, …
2. IP traffic flow records generated by routers – Source and destination IP address of packet flows, #packets, #bytes, …
3. Protocol transitions – Handovers of mobile device between wireless basestations
• Generated continuously, 100s of Terabytes per day • Many other operational datasets • Used in network management over a range of timescales
– From months (network planning) to seconds (network attack detection)
Structure of Large ISP Networks
Peering with other ISPs
Access Networks: Wireless, DSL, IPTV
City-‐level Router Centers
Backbone Links
Downstream ISP and business customers
Service and Datacenters
Network Management & AdministraHon
Measuring the ISP Network: Data Sources
Peering
Access
Router Centers
Backbone
Business
Datacenters Management
One-‐way Packet Loss & Latency AcHve probing between Measurement devices
Measuring the ISP Network: Data Sources
Peering
Access
Router Centers
Backbone
Business
Datacenters Management
Roundtrip Packet Loss & Latency Monitoring both direcHons of traffic between two hosts
Measuring the ISP Network: Data Sources
Peering
Access
Router Centers
Backbone
Business
Datacenters Management
Status Reports: Device failures and transiHons
Measuring the ISP Network: Data Sources
Peering
Access
Router Centers
Backbone
Business
Datacenters Management
Customer Care Logs
Measuring the ISP Network: Data Sources
Peering Router Centers
Backbone
Business
Datacenters Management
Protocol Monitoring: e.g. Wireless Handovers
A
B
C
D Active set: (A,B)
Active set: (C,D)
Measuring the ISP Network: Data Sources
Peering
Access
Router Centers
Backbone
Datacenters Management
Link Traffic Rates Timeseries of traffic per router interface, 5 minute granularity
0:00
0:
05
0:10
0:
15
0:20
0:
25
0:30
0:
35
Measuring the ISP Network: Data Sources
Peering
Access
Router Centers
Backbone
Business
Datacenters Management
IP Traffic Flow Records Generated by routers
Three challenges for ISP data analysis
• Scale: some datasets are enormous – IP Traffic Flow Records, Mobile device handovers,…
• Incompleteness: – Not all quantities can be directly measured
• Would like to know packet loss and latency per link • Typically only measure these on a path comprising multiple links
• Complexity – Complex statistical properties difficult to model
• Noisy data, skewed distributions, 80-20 laws, correlations
• The methods in this course tackle these challenges
?
1. Traffic Flow Measurement
• IP Protocol layers & packet headers • Router based traffic measurement • Measurement design decisions • Traffic flows, NetFlow
Protocol layers in the Internet
Network packet
payload link header
payload IP header
payload transport header
application packet
IP packet header
• Routers: use DstIP for packet forwarding – Determine router egress interface for the packet
• How? – Routers can’t store (DstIP, egress) for each possible DstIP (232 ~ 4G)
IP version 4 (IPv4) Main focus here: 32 bit Source IP address (SrcIP) 32 bit Destination IP address (DstIP) Usually written in dot decimal notation, e.g., 128.194.121.31 Also: IP Protocol (Proto) signifies which IP protocol is used in the remainder of the packet
0 15 16 31
• Prefix = first m bits of IP address for some m ≤ 32 • Represents a block of addresses
– First m bits in common; remaining 32 – m bits take any value
• CIDR notation for address block – dot_decimal_address / prefix length e.g. 192.168.100.0 / 22 – Comprises 232-22 = 210 addresses from 192.168.100.0 to 192.168.103.255
• In binary notation – 192.168.100.0 = 11000000.10101000.01100100.00000000 First 22 bits common – 192.168.103.255 = 11000000.10101000.01100111.11111111
IP Prefixes
IP Routing and Prefixes
• Routers maintain a routing table – Routing table = lists of (DstIP_Prefix, egress) pairs; currently ~500k – How? Routers communicate by protocols to announce and update tables
• Forwarding Packets – Find longest prefix (DstIP_Prefix, egress) in table that matches packet – Forward packet to egress interface
• More detail: Petersen & Davie, Chap 3.2 & 3.3
IP Header and Information for ISPs
• Have seen that IP header information is used to forward packets in routers in the ISP infrastructure
• How could an ISP use this information for network management if it could be monitored, recorded and analysed?
• Two example uses: – Network planning: identify potential new customers based on volumes
of traffic to or from their IP addresses – Attack detection: detect an anomalous burst of traffic destined to a
customer
• Many other ISP network management tasks used IP header information over range of timescales: from months to seconds
Protocol layers in the Internet
Network packet
payload link header
payload IP header
payload transport header
application packet
Transport and Other Protocols
• Most data transmission accomplished by one of two IP transport protocols that provide the appearance of a communications channel between hosts
• TCP: Transmission Control Protocol (Proto = 6) – connection oriented protocol providing
• three-way handshake to setup connection • reliable ordered transmission • congestion-avoidance
• UDP: User Datagram Protocol (Proto = 17) – connectionless, no reliability, no congestion avoidance
• Other (non-transmission) IP protocols in common use – ICMP: Internet Contol Message Protocol (Proto = 1)
• used to communicate error conditions; leveraged for probing & debugging
Transport Layer Header
• 16 bit source port (SrcPrt) and destination ports (DstPrt) • Used in both TCP and UDP • Associate packets with applications at hosts (see: binding)
– Ports 0-1023: well known, assigned by IANA (mostly) • E.g. HTTP server listens in port 80 ; DNS uses port 53
– Ports 1024-49151: registered ports • E.g. minecraft 19132
– Ports 49152-65535: dynamic ports
• More detail: Petersen & Davie, Chap. 5.1, 5.2
UDP Header 0 15 16 31 Source port Destination port UDP Length UDP checksum
TCP/UDP Header and Information for ISPs
• Have seen that UDP/TCP header information (port numbers) is used at hosts to associate packets to applications
• Many of these associations are registered by IANA • The identify of the application that generated a packet can be
inferred (to some degree) from transport header port numbers • How could an ISP use this info for network management? • Two example uses:
– Network planning: detecting growth of new applications • E.g. various P2P applications, but some ports dynamic or unofficial
– Attack detection: e.g. signature of exploit of application vulnerability • E.g. Slammer Worm: UDP port 1434 (MS SQL Server), 376 byte packets
Measuring Network Traffic
• ISPs: useful to record packet SrcIP, DstIP, SrcPrt, DstPrt • How can routers do this? • Finest conceivable granularity?
– Routers record (SrcIP,…) for each packet, export result to a collector – Constraints: router cycles, network bandwidth for collection
• Possible with special purpose measurement devices for limited time
• Coarse time granularity? – Maintain counters of packet/bytes for each (SrcIP,…) seen – Report at fixed time interval (e.g. every hour), then reset counters to 0. – Constraints:
• storage: how many distinct combinations (SrcIP…) seen in each interval? • staleness: information may lose usefulness with reporting delay
Network Traffic Flows
• Better: – Exploit inherent timescale of packets generated by a user application
• Intuition: packets group into “sessions” e.g. web download, VOIP call, …
• Abstractly, define an IP Flow: – Set of packets with a shared property observed over some time period
• Shared property is called the key – Typically a tuple of fields from the IP and transport headers – No unique definition of key; depends on purpose
• 5-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto) – Application-to-application flow
• 2-tuple key: (SrcIP, DstIP) – Host-to-host flow
• 7-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto, ToS, Ingress Intf) – Used for flow measurement in some routers
Flow measurement in routers
• Routers maintain statistics on flows in a flow table – Each flow: key, #packets, #bytes, first & last packet times, …
• Each packet – If no entry for packet key in flow table, instantiate new entry
#packets(key) = #bytes(key) = 0; first_packet_time(key) = timestamp, … – Update flow entry
#packets(key)++ ; #bytes(key) += bytes; last_packet_time(key) = timestamp, …
key 1 key 2 key 3 key 4
Hme
key, bytes, timestamp, …
key3 stats3 key2 stats2
Packet Flow table
hash(key) key1 stats1
key4 stats4
Flow termination
• No precise definition behind intuition of flow as a “session” • Routers use several criteria to terminate flows (configurable)
– Protocol signals: packets TCP FIN flag is set, ending TCP connection – Inactive timeout: time since last observed flow packet > Tinactive
– Active timeout: time since first observed flow packet > Tactive
– Flow table occupancy: terminate some flows if table occupancy > p%
Flow records: realization & collection
• Statistics of terminated flow exported in flow record – release flow table memory for new flow statistics
• Realization – Cisco NetFlow dominates
• Current version 9; flow definition, export format highly configurable • Most other router vendors offer (some version of) NetFlow
– Embodied in Internet Engineering Task Force Standards • IP Flow Information eXport Working Group
• Flow record collectors – Network management software vendors offer collector/analysers – Some public domain tools, e.g. cflowd
• Related approaches – e.g. sFlow
• The future – Dynamically configurable measurement in software defined networking
Background Reading
NetFlow and IETF Standards • Cisco NetFlow White Paper:
– http://tinyurl.com/cisco-netflow-whitepaper
• IETF IPFIX Working Group – WG Charter: http://datatracker.ietf.org/wg/ipfix/charter/ – Applying IPFIX Tutorial: http://www.ietf.org/edu/tutorials/ipfix-tutorial.pdf