Download ppt - Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University

Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1

1University of Illinois at Urbana-Champaign2University of Wisconsin at Madison

3IBM T. J. Watson Research Center4University of California at Santa Barbara

1

OutlineMotivation

The efficiency bottleneck encountered in big networks

Patterns must be preservedSummarize-MineExperimentsSummary

2

3

Frequent Subgraph MiningFind all graphs p such that |Dp| >= min_supGet into the topological structures of graph

dataUseful for many downstream applications

4

ChallengesSubgraph isomorphism checking is inevitable

for any frequent subgraph mining algorithmThis will have problems on big networks

Suppose there is only one triangle in the network

But there are 1,000,000 length-2 pathsWe must enumerate all these 1,000,000,

because any one of them has the potential to grow into a full triangle

5

Too Many EmbeddingsSubgraph isomorphism is NP-hard

So, when the problem size increases, …During the checking, large graphs are grown

from small subpartsFor small subparts, there might be too many

(overlapped) embeddings in a big networkSuch embedding enumerations will finally kill

us

6

Motivating ApplicationSystem call graphs from security research

Model dependencies among system callsUnique subgraph signatures for malicious

programsCompare malicious/benign programs

These graphs are very bigThousands of nodes on averageWe tried state-of-art mining technologies, but

failed

7

Our ApproachSubgraph isomorphism checking cannot be

done on large networksSo we do it on small graphs

Summarize-MineSummarize: Merge nodes by label and collapse

corresponding edgesMine: Now, state-of-art algorithms should work

8

Mining after Summarization

Summarize

G1

g1

G2

g2

… Original

Summary

Mining&

Output

a

b

c

a

a c

ab

a

b

a

bc

…

…

…

…c

…

9

Remedy for Pattern ChangesFrequent subgraphs are presented on a

different abstraction levelFalse negatives & false positives, compared to

true patterns mined from the un-summarized database D

False negatives (recover)Randomized technique + multiple rounds

False positives (delete)Verify against DSubstantial work can be transferred to the

summaries10

OutlineMotivationSummarize-Mine

The algorithm flow-chartRecovering false negativesVerifying false positives

ExperimentsSummary

11

12

False NegativesFor a pattern p, if each of its vertices bears a

different label, then the embeddings of p must be preserved after summarization

Since we are merging groups of vertices by label, the nodes of p should stay in different groups

Otherwise,

...

a

b

a

c

b

a

Gigi

c

a

bcp

13

Missing Prob. of EmbeddingsSuppose

Assign xj nodes for label lj (j=1,…,L) in the summary Si => xj groups of nodes with label lj

in the original graph Gi

Pattern p has mj nodes with label lj

Then

14

No “Collision” for Same LabelsConsider a specific embedding f: p->Gi, f is

preserved if vertices in f(p) stay in different groups

Randomly assign mj nodes with label lj to xj

groups, the probability that they will not “collide” is:

Multiply probabilities for independent events15

ExampleA pattern with 5 labels, each label => 2

verticesm1 = m2 = m3 = m4 = m5 = 2

Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each labelThe summary has 100 verticesx1 = x2 = x3 = x4 = x5 = 20

The probability that an embedding will persist

16

774.020

19

20

19

20

19

20

19

20

19

Extend to Multiple GraphsSetting x1,…,xL to the same values across all

Gi’s in the database only depends on m1,…,mL, i.e., pattern

p’s vertex label distribution We denote this probability as q(p)

For each of p’s support graphs in D, it has a probability of at least q(p) to continue support pThus, the overall support can be bounded

below by a binomial random variable

17

Support Moves Downward

18

False Negative Bound

19

Example, Cont.As above, q(p)=0.774min_sup=50

20

min_sup' 40 39 38 37 36 35

1 round 0.5966

0.4622

0.3346

0.2255

0.1412

0.0820

2 rounds 0.3559

0.2136

0.1119

0.0508

0.0199

0.0067

3 rounds 0.2123

0.0988

0.0374

0.0115

0.0028

0.0006

False Positives

Much easier to handleJust check against the original database DDiscard if this “actual” support is less than

min_sup

a

b

a

cb

a

Gi

gi

c

p

a

a cb

a

21

The Same Skeleton as gSpanDFS code treeDepth-first search

Minimum DFS code?Check support by

isomorphism testsRecord all one-edge

extensions along the way

Pass down the projected database and recurse

22

Integrate Verification SchemesTop-Down and Bottom-UpPossible factors

Amount of false positivesTop-down verification can

be performed earlyTop-down preferred

by experiments

23

Transaction ID list for p1 => Dp1

Just search within Dp1

Transaction ID list for p2 => Dp2

Just search within D-Dp2;if frequent, can stop

Summary-Guided VerificationSubstantial verification work can be

performed on the summaries, as well

24

Got it!

Iterative Summarize-MineUse a single pattern tree to hold all results

spanning across multiple iterationsNo need to combine pattern sets in a final stepAvoid verifying patterns that have already been

checked by previous iterationsVerified support graphs are accurate, they can

help pre-pruning in later iterationsDetails omitted

25

OutlineMotivationSummarize-MineExperimentsSummary

26

DatasetReal data

W32.Stration, a family of mass-mailing wormsW32.Virut, W32.Delf, W32.Ldpinch,

W32.Poisonivy, etc.Vertex # up to 20,000 and edge # even higherAvg. # of vertices: 1,300

Synthetic dataSize, # of distinct node/edge labels, etc.Generator details omitted

27

A Sample Malware SignatureMined from W32.StrationA malware reading and leaking certain

registry settings related to the network devices

28

Comparison with gSpangSpan is an efficient graph pattern mining

algorithmGraphs with different size are randomly

drawnEventually, gSpan cannot work

29

The Influence of min_sup' Total vs. False PositivesThe gap corresponds to true patternsIt gradually widens as we decrease min_sup'

30

Summarization Ratio10/1 node(s) before/after summarization =>

ratio=10Trading-off min_sup' and t as the inner loopA range of reasonable parameters in the

middle

31

ScalabilityOn the synthetic dataParameters are tuned as done above

32

OutlineMotivationSummarize-MineExperimentsSummary

33

SummaryWe solve the frequent subgraph mining problem

for graphs with big sizeWe found interesting malware signaturesOur algorithm is much more efficient, while the

state-of-art mining technologies do not workWe show that patterns can be well preserved on

higher-level by a good generalization schemeVery useful, given the emerging trend of huge

networksThe data has to be preprocessed and summarized

34

SummaryOur method is orthogonal to many previous

works on this topic => Combine for further improvementEfficient pattern space traversalOther data space reduction techniques

different from our compression within individual transactions Transaction sampling, merging, etc. They perform compression between transactions

35

36