Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1
1University of Illinois at Urbana-Champaign2University of Wisconsin at Madison
3IBM T. J. Watson Research Center4University of California at Santa Barbara
1
OutlineMotivation
The efficiency bottleneck encountered in big networks
Patterns must be preservedSummarize-MineExperimentsSummary
2
3
Frequent Subgraph MiningFind all graphs p such that |Dp| >= min_supGet into the topological structures of graph
dataUseful for many downstream applications
4
ChallengesSubgraph isomorphism checking is inevitable
for any frequent subgraph mining algorithmThis will have problems on big networks
Suppose there is only one triangle in the network
But there are 1,000,000 length-2 pathsWe must enumerate all these 1,000,000,
because any one of them has the potential to grow into a full triangle
5
Too Many EmbeddingsSubgraph isomorphism is NP-hard
So, when the problem size increases, …During the checking, large graphs are grown
from small subpartsFor small subparts, there might be too many
(overlapped) embeddings in a big networkSuch embedding enumerations will finally kill
us
6
Motivating ApplicationSystem call graphs from security research
Model dependencies among system callsUnique subgraph signatures for malicious
programsCompare malicious/benign programs
These graphs are very bigThousands of nodes on averageWe tried state-of-art mining technologies, but
failed
7
Our ApproachSubgraph isomorphism checking cannot be
done on large networksSo we do it on small graphs
Summarize-MineSummarize: Merge nodes by label and collapse
corresponding edgesMine: Now, state-of-art algorithms should work
8
Mining after Summarization
Summarize
G1
g1
G2
g2
… Original
Summary
Mining&
Output
a
b
c
a
a c
ab
a
b
a
bc
…
…
…
…c
…
9
Remedy for Pattern ChangesFrequent subgraphs are presented on a
different abstraction levelFalse negatives & false positives, compared to
true patterns mined from the un-summarized database D
False negatives (recover)Randomized technique + multiple rounds
False positives (delete)Verify against DSubstantial work can be transferred to the
summaries10
OutlineMotivationSummarize-Mine
The algorithm flow-chartRecovering false negativesVerifying false positives
ExperimentsSummary
11
12
False NegativesFor a pattern p, if each of its vertices bears a
different label, then the embeddings of p must be preserved after summarization
Since we are merging groups of vertices by label, the nodes of p should stay in different groups
Otherwise,
...
a
b
a
c
b
a
Gigi
c
a
bcp
13
Missing Prob. of EmbeddingsSuppose
Assign xj nodes for label lj (j=1,…,L) in the summary Si => xj groups of nodes with label lj
in the original graph Gi
Pattern p has mj nodes with label lj
Then
14
No “Collision” for Same LabelsConsider a specific embedding f: p->Gi, f is
preserved if vertices in f(p) stay in different groups
Randomly assign mj nodes with label lj to xj
groups, the probability that they will not “collide” is:
Multiply probabilities for independent events15
ExampleA pattern with 5 labels, each label => 2
verticesm1 = m2 = m3 = m4 = m5 = 2
Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each labelThe summary has 100 verticesx1 = x2 = x3 = x4 = x5 = 20
The probability that an embedding will persist
16
774.020
19
20
19
20
19
20
19
20
19
Extend to Multiple GraphsSetting x1,…,xL to the same values across all
Gi’s in the database only depends on m1,…,mL, i.e., pattern
p’s vertex label distribution We denote this probability as q(p)
For each of p’s support graphs in D, it has a probability of at least q(p) to continue support pThus, the overall support can be bounded
below by a binomial random variable
17
Support Moves Downward
18
False Negative Bound
19
Example, Cont.As above, q(p)=0.774min_sup=50
20
min_sup' 40 39 38 37 36 35
1 round 0.5966
0.4622
0.3346
0.2255
0.1412
0.0820
2 rounds 0.3559
0.2136
0.1119
0.0508
0.0199
0.0067
3 rounds 0.2123
0.0988
0.0374
0.0115
0.0028
0.0006
False Positives
Much easier to handleJust check against the original database DDiscard if this “actual” support is less than
min_sup
a
b
a
cb
a
Gi
gi
c
p
a
a cb
a
21
The Same Skeleton as gSpanDFS code treeDepth-first search
Minimum DFS code?Check support by
isomorphism testsRecord all one-edge
extensions along the way
Pass down the projected database and recurse
22
Integrate Verification SchemesTop-Down and Bottom-UpPossible factors
Amount of false positivesTop-down verification can
be performed earlyTop-down preferred
by experiments
23
Transaction ID list for p1 => Dp1
Just search within Dp1
Transaction ID list for p2 => Dp2
Just search within D-Dp2;if frequent, can stop
Summary-Guided VerificationSubstantial verification work can be
performed on the summaries, as well
24
Got it!
Iterative Summarize-MineUse a single pattern tree to hold all results
spanning across multiple iterationsNo need to combine pattern sets in a final stepAvoid verifying patterns that have already been
checked by previous iterationsVerified support graphs are accurate, they can
help pre-pruning in later iterationsDetails omitted
25
OutlineMotivationSummarize-MineExperimentsSummary
26
DatasetReal data
W32.Stration, a family of mass-mailing wormsW32.Virut, W32.Delf, W32.Ldpinch,
W32.Poisonivy, etc.Vertex # up to 20,000 and edge # even higherAvg. # of vertices: 1,300
Synthetic dataSize, # of distinct node/edge labels, etc.Generator details omitted
27
A Sample Malware SignatureMined from W32.StrationA malware reading and leaking certain
registry settings related to the network devices
28
Comparison with gSpangSpan is an efficient graph pattern mining
algorithmGraphs with different size are randomly
drawnEventually, gSpan cannot work
29
The Influence of min_sup' Total vs. False PositivesThe gap corresponds to true patternsIt gradually widens as we decrease min_sup'
30
Summarization Ratio10/1 node(s) before/after summarization =>
ratio=10Trading-off min_sup' and t as the inner loopA range of reasonable parameters in the
middle
31
ScalabilityOn the synthetic dataParameters are tuned as done above
32
OutlineMotivationSummarize-MineExperimentsSummary
33
SummaryWe solve the frequent subgraph mining problem
for graphs with big sizeWe found interesting malware signaturesOur algorithm is much more efficient, while the
state-of-art mining technologies do not workWe show that patterns can be well preserved on
higher-level by a good generalization schemeVery useful, given the emerging trend of huge
networksThe data has to be preprocessed and summarized
34
SummaryOur method is orthogonal to many previous
works on this topic => Combine for further improvementEfficient pattern space traversalOther data space reduction techniques
different from our compression within individual transactions Transaction sampling, merging, etc. They perform compression between transactions
35
36