Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Graph mining for log data
1
David Andrzejewski -‐ @davidandrzej Data Sciences, Sumo Logic Strata – Hardcore Data Science Track February 18, 2015
This talk: Graph Mining + Log Data
2
logs graph mining applicaMon examples
YES
This talk: Graph Mining + Log Data
3
logs graph mining applicaMon examples
tools scaling
YES
NO
Nodes
Graphs!
4
Nodes Edges
– undirected
Graphs!
5
Nodes Edges
– undirected – directed
Graphs!
6
Nodes Edges
– undirected – directed
Components
Graphs!
7
Nodes Edges
– undirected – directed
Components Paths/reachability
Graph data
8
Nodes Edges
– undirected – directed
Components Paths/reachability Subgraphs
Graphs!
9
Nodes Edges
– undirected – directed
Components Paths/reachability Subgraphs Degree
Graphs!
10
1
3
2
2
Nodes Edges
– undirected – directed
Components Paths/reachability Subgraphs Degree Labels
Graphs!
11
12
Graph Nodes Edges Social People Friendship
13
Documents
Politics
Documents
Politics
Documents
Politics
Documents
Politics
Graph Nodes Edges Social People Friendship Web Pages Links
14
API Auth
User
Org
Graph Nodes Edges Social People Friendship Web Pages Links
System Services API Calls
Anatomy of a log message: Five W’s
15
Anatomy of a log message: Five W’s
16
When? Timestamp with Mme zone
Anatomy of a log message: Five W’s
17
When? Timestamp with Mme zone Where? Host, module, code locaMon
Anatomy of a log message: Five W’s
18
When? Timestamp with Mme zone Where? Host, module, code locaMon Who? AuthenMcaMon context
Anatomy of a log message: Five W’s
19
When? Timestamp with Mme zone Where? Host, module, code locaMon Who? AuthenMcaMon context What? Log level and key-‐value pairs
Context: Sumo Logic
20
“Turning Machine Data Into IT and Business Insights”
InteracMons / connecMons in log data
21
Human – Machine – behavior analysis
• business intelligence • security
InteracMons / connecMons in log data
Human – Machine – behavior analysis
• business intelligence • security
Machine – Machine
– API calls • ops / troubleshooMng
InteracMons / connecMons in log data
23
Human – Machine – behavior analysis
• business intelligence • security
Machine – Machine
– API calls • ops / troubleshooMng
Human – Human – not usually logged...yet
User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 … ERROR accountID=1234 not found! PROCESSING FAILED: webID=79F92
Use case: troubleshooMng
User action webID=7F92 Use case: troubleshooMng
User action webID=7F92 Initiating requestID=082A for webID=7F92 …
Use case: troubleshooMng
User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A …
Use case: troubleshooMng
User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A …
Use case: troubleshooMng
User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 …
Use case: troubleshooMng
User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 … ERROR accountID=1234 not found! PROCESSING FAILED: webID=79F92
Use case: troubleshooMng
Connected components
Parse fields from each log event
ℓi1, ℓi2, ... ℓi
Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 …
Connected components
Parse fields from each log event Build graph
– nodes = each log event – edges = do a pair of logs match on any field?
ℓi1, ℓi2, ... ℓi
ℓi
eij =!
k{ℓik = ℓ
jk}
Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 …
Connected components
Parse fields from each log event Build graph
– nodes = each log event – edges = do a pair of logs match on any field?
Calculate undirected connected components Output: parMMon over
ℓi1, ℓi2, ... ℓi
ℓi
eij =!
k{ℓik = ℓ
jk}
ℓiO(n)
Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 …
Distributed systems tracing infrastructure
Dapper (Google) Zipkin (Twiher) X-‐Trace (UC-‐Berkeley) inCapacity (LinkedIn) Erlang / Akka Commercial products
Use case: online shopping
User interacMons – state transiMon graph – internal call cascades
Login
Browse
Check out
Add to cart
Use case: online shopping
User interacMons – state transiMon graph – internal call cascades
Goals: idenMfy unusual... – ... user behavior – ... service behavior
Login
Browse
Check out
Add to cart
$ € ¥
Use case: online shopping
idenMfy visits (eg, connected components)
Visit
37CF
5450
A84B
...
FF71
Use case: online shopping
idenMfy visits (eg, connected components) “featurize”
Visit Login Browse Cart Checkout
37CF 1 7 1 0
5450 0 3 2 1
A84B 2 1 1347 0
...
...
...
FF71 2 13 2 0
Use case: online shopping
idenMfy visits (eg, connected components) “featurize” staMsMcal modeling / machine learning
Visit Login Browse Cart Checkout
37CF 1 7 1 0
5450 0 3 2 1
A84B 2 1 1347 0
...
...
...
... ...
FF71 2 13 2 0
Use case: online shopping
idenMfy visits (eg, connected components) “featurize” staMsMcal modeling / machine learning
Visit Login Browse Cart Checkout
37CF 1 7 1 0
5450 0 3 2 1
A84B 2 1 1347 0
...
...
...
... ...
FF71 2 13 2 0
Use case: online shopping
AlternaMve featurizaMon – previous: “node-‐wise” – alternaMve “edge-‐wise”
Visit Login > Browse
Browse > Cart
Cart > Browse
Browse > Checkout
Login > Checkout
...
37CF 1 7 1 0 0 ...
5450 1 3 2 1 0 ...
A84B 0 0 0 0 799 ...
... ... ... ... ... ... ...
FF71 1 13 2 0 ... ...
ML / stats detour: fixed-‐length feature vectors Fischer Iris dataset (1936)
Sepal length
Sepal width
Petal length
Petal width
Species
5.0 3.5 1.6 0.6 I. setosa
5.9 3.2 4.8 1.8 I. versicolor
6.1 2.6 5.6 1.4 I. virginica
... ... ... ... ... Photo: Danielle Langlois
ML / stats detour: fixed-‐length feature vectors
Sepal length
Sepal width
Petal length
Petal width
Species
5.0 3.5 1.6 0.6 I. setosa
5.9 3.2 4.8 1.8 I. versicolor
6.1 2.6 5.6 1.4 I. virginica
... ... ... ... ... Photo: Danielle Langlois
Fischer Iris dataset (1936)
Always Be Featurizing
Node • properMes • connecMvity • neighbors
• compromised machine
Target enLty Features ApplicaLons
Always Be Featurizing
Node • properMes • connecMvity • neighbors
• compromised machine
Edge • properMes • nodes • node features
• high latency • rare connect
Target enLty Features ApplicaLons
Always Be Featurizing
Node • properMes • connecMvity • neighbors
• compromised machine
Edge • properMes • nodes • node features
• high latency • rare connect
Graph • nodes / edges • connecMvity • subgraph
• failed session • misbehavior
Target enLty Features ApplicaLons
Use case: unusual remote access detecMon Remote access (eg, SSH) graphs Are our observaMons “typical”?
Use case: unusual remote access detecMon Remote access (eg, SSH) graphs Are our observaMons “typical”?
– machine-‐edge: connect from host X to host Y?
Use case: unusual remote access detecMon Remote access (eg, SSH) graphs Are our observaMons “typical”?
– machine-‐edge: connect from host X to host Y? – graph: maximum depth / path length?
Use case: unusual remote access detecMon Remote access (eg, SSH) graphs Are our observaMons “typical”?
– machine-‐edge: connect from host X to host Y? – graph: maximum depth / path length? – user-‐edge: that user A connects to host X?
GOAL: understand usage of (expensive!) internal service – each observaMon is an invoking call graph
How are different invocaMons... – ...the same? – ....different?
Use case: understanding internal API calls
51
given a collecMon of graphs return sub-‐graphs which occur in graphs
Frequent substructure mining
52
≥ T
given a collecMon of graphs return sub-‐graphs which occur in graphs
Frequent substructure mining
53
≥ T
given a collecMon of graphs return sub-‐graphs which occur in graphs
Frequent substructure mining
54
≥ T
Frequent subgraphs presence/absence as feature – very common: “infrastructural” stuff – somewhat common: different usage modes
Use case: understanding internal API calls
55
Request
Auth
Cache
Shadow path
Standard 1 0 0
OpMmized
1 1 0
“Shadowed” 1 0 1
Feature-‐based graph mining strategy
1. Determine your goal • ID unusual access
Domain knowledge
Step Example
Feature-‐based graph mining strategy
1. Determine your goal 2. Build graph representaMon
• ID unusual access • Remote access graph
Domain knowledge Graph mining
Step Example
...
...
Feature-‐based graph mining strategy
1. Determine your goal 2. Build graph representaMon 3. Frame quesMon graphically
• ID unusual access • Remote access graph • High out-‐degree?
Domain knowledge Graph mining
Step Example
...
...
Feature-‐based graph mining strategy
1. Determine your goal 2. Build graph representaMon 3. Frame quesMon graphically 4. “Featurize” graph element(s)
• ID unusual access • Remote access graph • High out-‐degree? • Node è Out-‐degree
Domain knowledge Graph mining Stats / ML / data mining
Step Example
Node Out Degree
A 2
B 0
C 76
... ...
Feature-‐based graph mining strategy
1. Determine your goal 2. Build graph representaMon 3. Frame quesMon graphically 4. “Featurize” graph element(s) 5. Apply modeling to features
• ID unusual access • Remote access graph • High out-‐degree? • Node è Out-‐degree • Fit parametric model
Domain knowledge Graph mining Stats / ML / data mining
Step Example
Acknowledgements, etc
61
Team: Jack Cheng, MarMn Castellanos, Leo Gau, Yuchen Zhao, Ariel Smoliar
Acknowledgements, etc
62
Team: Jack Cheng, MarMn Castellanos, Leo Gau, Yuchen Zhao, Ariel Smoliar
We’re selling!
Acknowledgements, etc
63
Team: Jack Cheng, MarMn Castellanos, Leo Gau, Yuchen Zhao, Ariel Smoliar
We’re selling!
We’re recruiMng!
Alternate approach: spectral clustering Services architecture graph
Use case: understanding internal API calls
64
Alternate approach: spectral clustering Services architecture graph
Use case: understanding internal API calls
65
Use case: customer behavior modeling
IDEA: treat visits as graphs – features: node, edge, graph! – labels: did they signup / convert / etc?