On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem

DIGITAL FORENSIC RESEARCH CONFERENCE

On The Database Lookup Problem Of Approximate Matching

By

Frank Breitinger, Harald Baier and Douglas White

Presented At

The Digital Forensic Research Conference

DFRWS 2014 EU Amsterdam, NL (May 7th - 9th)

DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized

the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners

together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working

groups, annual conferences and challenges to help drive the direction of research and development.

http:/dfrws.org

On the database lookup problem ofapproximate matching

Frank Breitinger, Harald Baier, Douglas White

Hochschule Darmstadt, CASED

DFRWS-EU, 2014-05-08

Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 1/32

Harald Baier

1. Doctoral degree from TU Darmstadt in the area of ellipticcurve cryptography.

2. Principal Investigator within Center for Advanced SecurityResearch Darmstadt (CASED)

3. Establishment of forensic courses within HochschuleDarmstadt.

4. Current working fields:I Fuzzy Hashing (IT forensics, biometrics).

I Anomaly detection in high-tra�c environments.

I Security protocols for eMRTD.


Motivation

Foundations

Problem description and solution overview

Experimental results and assessment

Conclusion and future work


Motivation

Foundations





Motivation

Foundations





Motivation

Foundations





Motivation

Foundations





Motivation

Motivation

Foundations





Motivation

Use Case: Prosecution

1. Police and prosecutors confronted with di↵erent storagemedia:

I Hard disk drives, solid-state drives, USB sticks.

I Mobile phones, SIM cards.

I Digital cameras, digital camcorders, SD cards.

I CDs, DVDs.

I RAM (dumps).

I ...

2. Amount of distrained data often exceeds 1 terabyte.


Motivation

Di↵erent views of 1 terabyte

1 terabyte of digital text is (ap-proximately) equal to:

1. 1 trillion characters: 1 character = 1 byte.

2. 220 million pages: 1 page = 5000 characters.

3. 21 years of printing time: 20 sheets per minute.

4. 1 million kg of paper: onesided printed.

5. Paper stack of 22 km height: bulk of 0.1 mm.


Motivation

Finding relevant files resembles ...

Source: tu-harburg.de Source: beepworld.de

Key question:How to minimize the haystack or enlarge the needle?


Motivation

Finding relevant files resembles ...

Source: tu-harburg.de Source: beepworld.de

Key question:How to minimize the haystack or enlarge the needle?


Foundations

Motivation

Foundations





Foundations

Hash functions in digital forensics

1. Automatically identify known files:I Filter in: Highlight suspect files (e.g., company secrets)

I Filter out: Remove non-relevant objects (e.g., OS files)

2. Proceeding:2.1 Hash the file,

2.2 Compare the resulting hash against a database,

2.3 and put it on one of the categories:I

Known-to-be-good (non-relevant).

IKnown-to-be-bad (suspect).

IUnknown files.

3. Goal:I Known files can be identified very e�ciently.

I Reduces amount of data investigator has to look at by hand.Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 9/32

Foundations

Cryptographic hash functions in digital forensics

1. Identifying exact duplicates is solved using cryptographic hashfunctions:

I Filter out: National Software Reference Library (NSRL).

I Filter in: Perkeo database in Germany.

2. A sample drawback: avalanche e↵ect.

1 $ echo ’Dear Angela, I give you 1 million EUR. Wolfgang’ | sha1sum2 9bf13969f2c283cfe0ace585667fa22c7ab4f84a -3

4 $ echo ’Dear Angela, I give you 1 billion EUR. Wolfgang’ | sha1sum5 60d0b09f8d18e75b3cd7ffb0406de84bbc459510 -


Foundations

Approximate matching

1. However, investigators need robust algorithms that allowsimilarity detection.

2. Sample use cases:I Di↵erent versions of files.

I Embedded objects.

I Fragments of files.

I Network packets.

=) Approximate matching (similarity hashing, fuzzy hashing).


Foundations

Our notation

x is the number of files in the database.

Bloom filter is a bit array to represent data.

m denotes the Bloom filter size in bits.

feature describes a byte sequence which is hashed and inserted intothe Bloom filter.

k number of sub-hashes where each one sets a bit in the Bloomfilter.

n is the number of features inserted into a Bloom filter.

s denotes the file set size in MiB.

SB denotes the set of blacklisted files.

SD denotes the set of files on a seized device.Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 12/32


Motivation

Foundations






NSRL-RDS

Cryptographic hash values can be sorted, e.g., RDS:

$ less NSRLFile.txt

"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode""000000206738748EDD92C4E3D2E823896700F849","392126E756571EBF112CB1C1CDEDF926","EBD105A0","I05002T2.PFB",98865,3095,"WIN","""0000004DA6391F7F5D2F7FCCF36CEBDA60C6EA02","0E53C14A3E48D94FF596A2824307B492","AA6A7B16","00br2026.gif",2226,228,"WIN","""000000A9E47BD385A0A3685AA12C2DB6FD727A20","176308F27DD52890F013A3FD80F92E51","D749B562","femvo523.wav",42748,4887,"MacOSX","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,18266,"358","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,2322,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,2575,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,2583,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,3271,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,3282,"UNK",""

=) E�cient decision if a given hash value matches a hash of theRDS (in O(log(x)) or O(1) comparisons)



Indexing problem of similarity digests

1. Similarity digests cannot be indexed in general:I To decide if a given fuzzy hash is similar to one of the

database requires O(x) comparisons, i.e., against all.

I Comparison complexity is O(xy) if a set comprising y elementsis compared to the database.

I Too slow for practical usage.

2. Winter et al. presented a solution for ssdeep digests (aBase64 sequence) called F2S2.

3. No solution for Bloom filter digests:I sdhash.

I mrsh-v2.

I mvhash-B.



Solution overview

1. Overall idea: store all files in one single (huge) Bloom filter.

2. Bloom filter should fit to RAM for e�ciency reasons.

3. Our setting aims at a ratio 1/100, i.e., a 200 GiB set SB

requires ⇡ 2 GiB Bloom filter.

4. Benefit:I Comparison complexity is O(1).

5. Drawback:I File to set comparison yields a binary decision.

I Result: yes, file is in the set vs. no, it is not.

I Su�cient for Blacklisting?!



Solution alternatives

1. Bloom filter of SB fits to RAM: Best case.I Bloom filter filled with the black listed files in advance.

I Files of SD compared against Bloom filter.

2. Bloom filter of SB does not fit to RAM: Worst case.I Fill Bloom filter with files of SD (if possible).

I Black listed files from SB are compared against Bloom filter ofSD (use precomputed hashes of SB if possible).

I Bloom filter of SD cannot be computed in advance.



Some details

1. Match decision:I A fragment of a given file is assumed to be in the Bloom filter,

if a su�ciently large number of subsequent features is found inthe filter (longest run, lr).

I Let rmin denote the minimum number of subsequent featuresfor a match: lr � rmin.

I Our prototype sets rmin = 6.

2. We aim at a fragment false positive rate of pf = 10

�6.

3. Bloom filter size:

m = � k · s · 214

ln(1� krminppf )

I Approximately 1/100 of the size of the input file set.



Our tool mrsh-net

1. Based on multi resolution hashing algorithm mrsh-v2.

2. Originally developed for network packet approximatematching.

3. Available viahttp://www.dasec.h-da.de/staff/breitinger-frank/

4. Result presentation:I Due to file to set comparison: no similarity score is computed.

I Instead the following (sample) output is given:

file1.ppt: 163 of 2518 (longest run: 111)

5. Parameters can be adjusted in the config file config.h.

6. The paper discuss all parameters and sample choices.


http://www.dasec.h-da.de/staff/breitinger-frank/


Motivation

Foundations






Test corpus

1. Real world files from the t5-corpus.

2. Available via http://roussev.net/t5/

3. Contains 4,457 files with a total size of 1.78 GiB.

4. The average file size is ⇡ 400 KiB.

5. File type distribution:

jpg gif doc xls ppt html pdf txt362 67 533 250 368 1093 1073 711


http://roussev.net/t5/


E�ciency: Database size

1. In case of sdhash, mrsh-v2, F2S2 and SHA-1 the databasecomprises the (similarity) hashes.

2. Worst case describes the scenario where the Bloom filter ofSB does not fit to RAM and hence is not used.

3. mrsh-net makes use of the default Bloom filter size of32 MiB (su�cient for set size of SB of ⇡ 3 GiB).



E�ciency: Run time

1. ’Hashing’ denotes the time to hash SD, i.e., to hash all files ofthe t5-corpus.

2. mrsh-net ’worst’-columns: 2nd column is optimised for thisdataset (more e�cient feature hash function for ’small’datasets).

3. ’Comparing’ is the time to compare all files of the t5-corpusagainst the hash database of SB (if available).

4. ’Total’ is the overall time (total = comparing + hashing).Harald Baier Database lookup problem / DFRWS-EU, 2014-05-08 23/32


E�ciency: Real world scenario

1. Assumptions:I Size of SB : 1, 500 GiB.

I Size of SD: 200 GiB.

2. We assume a linear growth in both space and run time.

3. The e�ciency advantage of mrsh-net is obvious.



Detection performance

1. Our prototype decides between match and non-match basedon the longest run and thus on longest common substring(LCS).

2. Key issue: there is no labelled reference data set available.

3. We therefore use an approximation of the LCS (aLCS) asexplained yesterday in the talk by Vassil Roussev.

I Ground truth decision based on aLCS score:

file1 | file2 | aLCS | entropy

a.dat | b.dat | 993 | 5.56

c.dat | d.dat | 11945 | 0.5

I Note: aLCS is a lower bound on actual LCS.



Definition of classification result

1. Definition of true positive (TP), false positive (FP), truenegative (TN) and false negative (FN) as follows:TP: mrsn(f,BF ) � rmin and aLCS(f,GT ) � rmin · bs.

FP: mrsn(f,BF ) � rmin and aLCS(f,GT ) < rmin · bs.

TN: mrsn(f,BF ) < rmin and aLCS(f,GT ) < rmin · bs.

FN: mrsn(f,BF ) < rmin and aLCS(f,GT ) � rmin · bs.

2. Remark: rmin · bs = 6 · 64 = 384 bytes.



Detection performance: Results

1. Confusion matrix:

Classified as Positive NegativeActual situation

Positive 2537 436

Negative 18 1466

2. Results:

Precision: TPTP+FP =

25372555 = 99.3%

Recall: TPTP+FN =

25372973 = 85.3%

Accuracy: TP+TNTP+FP+TN+FN =

40034457 = 89.8%



Decrease false negatives

1. Having a closer look at the very high number of falsenegatives, we observe that most aLCS matches are based onlow entropy sequences.

2. Future work will take entropy of LCS into account.



Motivation

Foundations






Take home messages

1. It is important to have indexing strategies for similaritydigests.

2. Otherwise they will not operate with practical speed.

3. We have presented and evaluated a new approach toe�ciently decide about the similar membership of a file to agiven dataset.

4. The lookup complexity decreased from O(x) comparisons toO(1) for one file.



Future work

1. Decrease the number of false negatives.

2. Perform a detection performance study in terms of ROC orDET curves.

3. Extend the algorithm to find the actual similar file.



Questions?

Source: www.dilbert.com/strips/2011-02-03


Documents

On the database lookup problem of approximate … The Database Lookup Problem Of Approximate Matching By ... Indexing problem of similarity digests 1. ... On the database lookup problem