38
DIGITAL FORENSIC RESEARCH CONFERENCE On The Database Lookup Problem Of Approximate Matching By Frank Breitinger, Harald Baier and Douglas White Presented At The Digital Forensic Research Conference DFRWS 2014 EU Amsterdam, NL (May 7 th - 9 th ) DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working groups, annual conferences and challenges to help drive the direction of research and development. http:/dfrws.org

On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Embed Size (px)

Citation preview

Page 1: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

DIGITAL FORENSIC RESEARCH CONFERENCE

On The Database Lookup Problem Of Approximate Matching

By

Frank Breitinger, Harald Baier and Douglas White

Presented At

The Digital Forensic Research Conference

DFRWS 2014 EU Amsterdam, NL (May 7th - 9th)

DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized

the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners

together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working

groups, annual conferences and challenges to help drive the direction of research and development.

http:/dfrws.org

Page 2: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

On the database lookup problem ofapproximate matching

Frank Breitinger, Harald Baier, Douglas White

Hochschule Darmstadt, CASED

DFRWS-EU, 2014-05-08

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 1/32

Page 3: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Harald Baier

1. Doctoral degree from TU Darmstadt in the area of ellipticcurve cryptography.

2. Principal Investigator within Center for Advanced SecurityResearch Darmstadt (CASED)

3. Establishment of forensic courses within HochschuleDarmstadt.

4. Current working fields:I Fuzzy Hashing (IT forensics, biometrics).

I Anomaly detection in high-tra�c environments.

I Security protocols for eMRTD.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 2/32

Page 4: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 3/32

Page 5: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 3/32

Page 6: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 3/32

Page 7: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 3/32

Page 8: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 3/32

Page 9: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 4/32

Page 10: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Use Case: Prosecution

1. Police and prosecutors confronted with di↵erent storagemedia:

I Hard disk drives, solid-state drives, USB sticks.

I Mobile phones, SIM cards.

I Digital cameras, digital camcorders, SD cards.

I CDs, DVDs.

I RAM (dumps).

I ...

2. Amount of distrained data often exceeds 1 terabyte.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 5/32

Page 11: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Di↵erent views of 1 terabyte

1 terabyte of digital text is (ap-proximately) equal to:

1. 1 trillion characters: 1 character = 1 byte.

2. 220 million pages: 1 page = 5000 characters.

3. 21 years of printing time: 20 sheets per minute.

4. 1 million kg of paper: onesided printed.

5. Paper stack of 22 km height: bulk of 0.1 mm.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 6/32

Page 12: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Finding relevant files resembles ...

Source: tu-harburg.de Source: beepworld.de

Key question:How to minimize the haystack or enlarge the needle?

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 7/32

Page 13: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Motivation

Finding relevant files resembles ...

Source: tu-harburg.de Source: beepworld.de

Key question:How to minimize the haystack or enlarge the needle?

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 7/32

Page 14: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Foundations

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 8/32

Page 15: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Foundations

Hash functions in digital forensics

1. Automatically identify known files:I Filter in: Highlight suspect files (e.g., company secrets)

I Filter out: Remove non-relevant objects (e.g., OS files)

2. Proceeding:2.1 Hash the file,

2.2 Compare the resulting hash against a database,

2.3 and put it on one of the categories:I

Known-to-be-good (non-relevant).

IKnown-to-be-bad (suspect).

IUnknown files.

3. Goal:I Known files can be identified very e�ciently.

I Reduces amount of data investigator has to look at by hand.Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 9/32

Page 16: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Foundations

Cryptographic hash functions in digital forensics

1. Identifying exact duplicates is solved using cryptographic hashfunctions:

I Filter out: National Software Reference Library (NSRL).

I Filter in: Perkeo database in Germany.

2. A sample drawback: avalanche e↵ect.

1 $ echo ’Dear Angela, I give you 1 million EUR. Wolfgang’ | sha1sum2 9bf13969f2c283cfe0ace585667fa22c7ab4f84a -3

4 $ echo ’Dear Angela, I give you 1 billion EUR. Wolfgang’ | sha1sum5 60d0b09f8d18e75b3cd7ffb0406de84bbc459510 -

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 10/32

Page 17: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Foundations

Approximate matching

1. However, investigators need robust algorithms that allowsimilarity detection.

2. Sample use cases:I Di↵erent versions of files.

I Embedded objects.

I Fragments of files.

I Network packets.

=) Approximate matching (similarity hashing, fuzzy hashing).

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 11/32

Page 18: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Foundations

Our notation

x is the number of files in the database.

Bloom filter is a bit array to represent data.

m denotes the Bloom filter size in bits.

feature describes a byte sequence which is hashed and inserted intothe Bloom filter.

k number of sub-hashes where each one sets a bit in the Bloomfilter.

n is the number of features inserted into a Bloom filter.

s denotes the file set size in MiB.

SB denotes the set of blacklisted files.

SD denotes the set of files on a seized device.Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 12/32

Page 19: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 13/32

Page 20: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

NSRL-RDS

Cryptographic hash values can be sorted, e.g., RDS:

$ less NSRLFile.txt

"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode""000000206738748EDD92C4E3D2E823896700F849","392126E756571EBF112CB1C1CDEDF926","EBD105A0","I05002T2.PFB",98865,3095,"WIN","""0000004DA6391F7F5D2F7FCCF36CEBDA60C6EA02","0E53C14A3E48D94FF596A2824307B492","AA6A7B16","00br2026.gif",2226,228,"WIN","""000000A9E47BD385A0A3685AA12C2DB6FD727A20","176308F27DD52890F013A3FD80F92E51","D749B562","femvo523.wav",42748,4887,"MacOSX","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,18266,"358","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,2322,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,2575,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,2583,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,3271,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,3282,"UNK",""

=) E�cient decision if a given hash value matches a hash of theRDS (in O(log(x)) or O(1) comparisons)

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 14/32

Page 21: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

Indexing problem of similarity digests

1. Similarity digests cannot be indexed in general:I To decide if a given fuzzy hash is similar to one of the

database requires O(x) comparisons, i.e., against all.

I Comparison complexity is O(xy) if a set comprising y elementsis compared to the database.

I Too slow for practical usage.

2. Winter et al. presented a solution for ssdeep digests (aBase64 sequence) called F2S2.

3. No solution for Bloom filter digests:I sdhash.

I mrsh-v2.

I mvhash-B.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 15/32

Page 22: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

Solution overview

1. Overall idea: store all files in one single (huge) Bloom filter.

2. Bloom filter should fit to RAM for e�ciency reasons.

3. Our setting aims at a ratio 1/100, i.e., a 200 GiB set SB

requires ⇡ 2 GiB Bloom filter.

4. Benefit:I Comparison complexity is O(1).

5. Drawback:I File to set comparison yields a binary decision.

I Result: yes, file is in the set vs. no, it is not.

I Su�cient for Blacklisting?!

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 16/32

Page 23: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

Solution alternatives

1. Bloom filter of SB fits to RAM: Best case.I Bloom filter filled with the black listed files in advance.

I Files of SD compared against Bloom filter.

2. Bloom filter of SB does not fit to RAM: Worst case.I Fill Bloom filter with files of SD (if possible).

I Black listed files from SB are compared against Bloom filter ofSD (use precomputed hashes of SB if possible).

I Bloom filter of SD cannot be computed in advance.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 17/32

Page 24: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

Some details

1. Match decision:I A fragment of a given file is assumed to be in the Bloom filter,

if a su�ciently large number of subsequent features is found inthe filter (longest run, lr).

I Let rmin denote the minimum number of subsequent featuresfor a match: lr � rmin.

I Our prototype sets rmin = 6.

2. We aim at a fragment false positive rate of pf = 10

�6.

3. Bloom filter size:

m = � k · s · 214

ln(1� krminppf )

I Approximately 1/100 of the size of the input file set.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 18/32

Page 25: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Problem description and solution overview

Our tool mrsh-net

1. Based on multi resolution hashing algorithm mrsh-v2.

2. Originally developed for network packet approximatematching.

3. Available viahttp://www.dasec.h-da.de/staff/breitinger-frank/

4. Result presentation:I Due to file to set comparison: no similarity score is computed.

I Instead the following (sample) output is given:

file1.ppt: 163 of 2518 (longest run: 111)

5. Parameters can be adjusted in the config file config.h.

6. The paper discuss all parameters and sample choices.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 19/32

Page 26: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 20/32

Page 27: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

Test corpus

1. Real world files from the t5-corpus.

2. Available via http://roussev.net/t5/

3. Contains 4,457 files with a total size of 1.78 GiB.

4. The average file size is ⇡ 400 KiB.

5. File type distribution:

jpg gif doc xls ppt html pdf txt362 67 533 250 368 1093 1073 711

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 21/32

Page 28: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

E�ciency: Database size

1. In case of sdhash, mrsh-v2, F2S2 and SHA-1 the databasecomprises the (similarity) hashes.

2. Worst case describes the scenario where the Bloom filter ofSB does not fit to RAM and hence is not used.

3. mrsh-net makes use of the default Bloom filter size of32 MiB (su�cient for set size of SB of ⇡ 3 GiB).

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 22/32

Page 29: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

E�ciency: Run time

1. ’Hashing’ denotes the time to hash SD, i.e., to hash all files ofthe t5-corpus.

2. mrsh-net ’worst’-columns: 2nd column is optimised for thisdataset (more e�cient feature hash function for ’small’datasets).

3. ’Comparing’ is the time to compare all files of the t5-corpusagainst the hash database of SB (if available).

4. ’Total’ is the overall time (total = comparing + hashing).Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 23/32

Page 30: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

E�ciency: Real world scenario

1. Assumptions:I Size of SB : 1, 500 GiB.

I Size of SD: 200 GiB.

2. We assume a linear growth in both space and run time.

3. The e�ciency advantage of mrsh-net is obvious.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 24/32

Page 31: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

Detection performance

1. Our prototype decides between match and non-match basedon the longest run and thus on longest common substring(LCS).

2. Key issue: there is no labelled reference data set available.

3. We therefore use an approximation of the LCS (aLCS) asexplained yesterday in the talk by Vassil Roussev.

I Ground truth decision based on aLCS score:

file1 | file2 | aLCS | entropy

a.dat | b.dat | 993 | 5.56

c.dat | d.dat | 11945 | 0.5

I Note: aLCS is a lower bound on actual LCS.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 25/32

Page 32: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

Definition of classification result

1. Definition of true positive (TP), false positive (FP), truenegative (TN) and false negative (FN) as follows:TP: mrsn(f,BF ) � rmin and aLCS(f,GT ) � rmin · bs.

FP: mrsn(f,BF ) � rmin and aLCS(f,GT ) < rmin · bs.

TN: mrsn(f,BF ) < rmin and aLCS(f,GT ) < rmin · bs.

FN: mrsn(f,BF ) < rmin and aLCS(f,GT ) � rmin · bs.

2. Remark: rmin · bs = 6 · 64 = 384 bytes.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 26/32

Page 33: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

Detection performance: Results

1. Confusion matrix:

Classified as Positive NegativeActual situation

Positive 2537 436

Negative 18 1466

2. Results:

Precision: TPTP+FP =

25372555 = 99.3%

Recall: TPTP+FN =

25372973 = 85.3%

Accuracy: TP+TNTP+FP+TN+FN =

40034457 = 89.8%

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 27/32

Page 34: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Experimental results and assessment

Decrease false negatives

1. Having a closer look at the very high number of falsenegatives, we observe that most aLCS matches are based onlow entropy sequences.

2. Future work will take entropy of LCS into account.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 28/32

Page 35: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Conclusion and future work

Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 29/32

Page 36: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Conclusion and future work

Take home messages

1. It is important to have indexing strategies for similaritydigests.

2. Otherwise they will not operate with practical speed.

3. We have presented and evaluated a new approach toe�ciently decide about the similar membership of a file to agiven dataset.

4. The lookup complexity decreased from O(x) comparisons toO(1) for one file.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 30/32

Page 37: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Conclusion and future work

Future work

1. Decrease the number of false negatives.

2. Perform a detection performance study in terms of ROC orDET curves.

3. Extend the algorithm to find the actual similar file.

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 31/32

Page 38: On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

Conclusion and future work

Questions?

Source: www.dilbert.com/strips/2011-02-03

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 32/32