Detecting visual plagiarism with perception hashing812148/FULLTEXT01.pdfDEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL STOCKHOLM, SWEDEN 2015 Detecting visual plagiarism with perception

DEGREE PROJECT, IN , FIRST LEVELCOMPUTER SCIENCE

STOCKHOLM, SWEDEN 2015

Detecting visual plagiarism withperception hashing

JOHANNES WÅLARÖ

KTH ROYAL INSTITUTE OF TECHNOLOGY

Detecting visual plagiarism with perceptionhashing

JOHANNES WÅLARÖ

DEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVELSupervisor: Richard James Glassey

AbstractClassifying images and deciding algorithmically whether ornot they are the same image is a complex problem in com-puter science. This report looks at using perceptual hashingin order to fingerprint images so that they may be matchedto one and other. The results shows that threshold match-ing images can be done with good accuracy, while a zerodistance match is not reliable in most scenarios. The con-clusion is that a hybrid approach that uses a more inclu-sive perception algorithm to filter out candidates for a morethorough algorithm might be a good choice.

Referat

Att klassifiera och bestämma algoritmiskt hurvida två bil-der är likadana är ett komplext problem inom datalogin.Denna rapport undersöker perceptuella hash algoritmer somen metod för att generera ett fingeravtryck till en bild så attde kan matchas mot varandra. Resultaten visar att tröskel-värdes matching kan användas med gott resultat, medansen noll distans matching inte är pålitlig i de flesta fall. Slut-satsen är att en hybrid lösning som använder en mer inklu-derande perceptuell hash för att filtera ut kandidater tillden mer djupgående algoritmen kan vara ett bra val.

Contents

1 Introduction 71.1 Objectives of this report . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 92.1 Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Normalized Hamming distance . . . . . . . . . . . . . . . . . . . . . 92.3 aHash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Image Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Gray scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 Calculate average color . . . . . . . . . . . . . . . . . . . . . 102.3.4 Generating the hash . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 dHash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 Image Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Gray scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Generating a difference field . . . . . . . . . . . . . . . . . . . 112.4.4 Generating the hash . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 pHash with DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.1 Gray scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.2 Image smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.3 Image Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.4 Discrete cosine transformation . . . . . . . . . . . . . . . . . 122.5.5 Calculate median value . . . . . . . . . . . . . . . . . . . . . 122.5.6 Use average value to generate hash . . . . . . . . . . . . . . . 12

2.6 pHash with Radial Variance . . . . . . . . . . . . . . . . . . . . . . . 132.6.1 Gray scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6.2 Image blurring . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6.3 Gamma correction . . . . . . . . . . . . . . . . . . . . . . . . 132.6.4 Radon Projections . . . . . . . . . . . . . . . . . . . . . . . . 132.6.5 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 132.6.6 Discrete Cosine transformation . . . . . . . . . . . . . . . . . 132.6.7 Peak of Cross Correlation[8] . . . . . . . . . . . . . . . . . . . 13

2.7 pHash with Marr–Hildreth Operator . . . . . . . . . . . . . . . . . . 14

2.7.1 Image Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7.2 Gray scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7.3 Image Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7.4 Correlation of the Marr-Hildreth kernel . . . . . . . . . . . . 142.7.5 Image normalization . . . . . . . . . . . . . . . . . . . . . . . 142.7.6 Hash generation . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Searching hashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8.1 Logarithmic search . . . . . . . . . . . . . . . . . . . . . . . . 152.8.2 Linear search . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 ImageHash 0.3 for python . . . . . . . . . . . . . . . . . . . . . . . . 152.10 pHash for C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Method 173.1 Image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Hash selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Rudimentary image manipulation . . . . . . . . . . . . . . . . . . . . 17

3.3.1 No change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 200% Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.3 Color correction . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.4 Contrast adjustment . . . . . . . . . . . . . . . . . . . . . . . 183.3.5 Black and white filter . . . . . . . . . . . . . . . . . . . . . . 18

3.4 One to One hash calculation . . . . . . . . . . . . . . . . . . . . . . . 183.5 One to Many hash calculation . . . . . . . . . . . . . . . . . . . . . . 18

4 Results 19

5 Discussion & Conclusion 21

Bibliography 23

Chapter 1

Introduction

Classifying images can be done in a myriad of different ways, everything from ob-ject recognition and bag of word classifications to disregarding the images contententirely and using strict cryptographic hashes.

Perceptual hashing is something of a middle road, instead of the classical paradigmof cryptographic hashes where the tiniest changes avalanches into an entirely differ-ent hash, perceptual hashing instead uses the images content to try to fingerprintthe image, so that even if hashes are not identical they can be used to determinehow “close” the images are to one and other (In terms of the underlying criteria).

Most of the perceptual hashes described in this report does this by a simplehamming distance, which is quick to calculate. Moreover the algorithms also havean amount of ductility when it comes to changes, so changing a single pixel forexample will not change the generated hash in most cases.

1.1 Objectives of this reportThe objective of this report is to look at a few different perceptual hashing algo-rithms and see how and if they might be used to detect visual plagiarism thoughhash matching.

1.2 Problem statementThe problem of detecting visual plagiarism is to find a method that can compare animage to a relatively large dataset in a reasonable amount of time, but also providegood accuracy in the matching.

7

Chapter 2

Background

2.1 Hamming distanceThe hamming distance[1] can be used on most of the resulting hashes to get theperceived difference between two images, in that a perceptually similar image wouldhave a short hamming distance (Ideally for the same image, a 0 distance).

The definition for the hamming distance is quite straightforward: The hammingdistance d(x,y) is the number of ways in which x and y differ.

In other words, since the algorithms return a binary number, the hammingdistance is simply the number of positions in which they are different.

2.2 Normalized Hamming distanceIn 8.1 we see that a Hamming Distance is calculated for a one dimensional sequence,a single sequence of binary digits in this case. A Normalized Hamming Distance inthis case works in much the same way, except that it is defined for a sequence ofbinary sequences instead, in this case an array of uint8’s[2].

The Normalized Hamming distances is simply the sum of partial Hamming dis-tances from each of the corresponding elements.

9

CHAPTER 2. BACKGROUND

2.3 aHashaHash uses the average pixel strength of the image to calculate a hash[3]

2.3.1 Image ResizingThe image is first reduced to an n by n image, where n is 8 in used implementation.

2.3.2 Gray scalingThe image is then reduced to gray scale to reduce the color information density ofthe image. This will effectively make the image an 8x8 field with values in the rangeof the gray scales color depth.

2.3.3 Calculate average colorThe images average color is calculated, by taking the sum of all the pixels colorsand dividing by the number of pixels.

2.3.4 Generating the hashThe hash is then generated simply by making the values in a corresponding vector(8x8 long) 1 if the pixel in question is above the average, which we calculated inthe previous step.

You can then interpret this 64-bit binary vector as you will, commonly as anhexadecimal string.

10

2.4. DHASH

2.4 dHashThis algorithm uses a difference field to generate a hash value[4]

2.4.1 Image ResizingThe image is first reduced to an n+1 by n image, where n is 8 in used implementa-tion.

2.4.2 Gray scalingThe image is reduced to gray scale to reduce the color information density of theimage. This will effectively make the image an 9x8 field with values in the range ofthe gray scales color depth.

2.4.3 Generating a difference fieldThe difference field is then calculated from the 9x8 image img as: dfi,j=imgi.j-imgi,j+1 This is why we use an odd number of columns; we will need an “extra”end column to calculate the 8:th difference field column.

2.4.4 Generating the hashIf the n:th cell has a positive value, the hashbit n is set to 1 otherwise it is set to 0.A 64 bit hash value is returned.

You can then interpret this 64-bit binary vector as you will, commonly as anhexadecimal string.

11


2.5 pHash with DCTThis version of pHash uses DCT[5]

2.5.1 Gray scalingThe image is converted to a grayscale image using its luminance.

2.5.2 Image smoothingA smoothing filter is applied to the image.

2.5.3 Image ResizingThe image is resized to an n by n picture, where n=32 is used in the implementation.

2.5.4 Discrete cosine transformationA discrete cosine transformation[6] is performed on the image by generating a DCTmatrix and performing a matrix multiplication with the image and then multiplyingthe resulting matrix with the DCT matrix transpose.

The 64 DCT coefficients with the lowest frequency is then used to make the hash(excluding the absolutely lowest coefficients). ie. dctImage.crop(1,1,8,8).unroll(’x’)

2.5.5 Calculate median valueThe median value of the selected DCT coefficients is calculated.

2.5.6 Use average value to generate hashThe hash is generated by going through the selected DCT coefficients and settingthe hashbit if the DCT value is higher than the calculated median value. A 64 bitvalue is returned.

12

2.6. PHASH WITH RADIAL VARIANCE

2.6 pHash with Radial VarianceThis version of pHash uses Radial Variance[7]

2.6.1 Gray scalingThe image is converted to a gray scale.

2.6.2 Image blurringThe image is blurred with a strength of a parameter sigma, commonly 1.

2.6.3 Gamma correctionThe image is gamma corrected with a strength of a parameter gamma, commonly1.

2.6.4 Radon ProjectionsThe image is used as the basis to generate a radon map, also sometimes called asinogram.

2.6.5 Feature extractionThe features from the sinogram is extracted.

2.6.6 Discrete Cosine transformationA DCT[6] calculation is performed (much in the same way as in 8.6.4) on the featuresfrom the sinogram is computed and returns a sequence of 40 DCT coefficients.

2.6.7 Peak of Cross Correlation[8]Instead of using hamming distance to compare hash “closeness” the radial type usesCross Correlation. Simply speaking the Cross correlation peaks where the images“match” and the peaks are summarized, if the sum is above a threshold, the imagesare declared to be a match.

13


2.7 pHash with Marr–Hildreth OperatorThis version of pHash uses the Marr-Hildreth Operator[9]

2.7.1 Image ResizingThe image is converted to a 512 by 512 pixel image.

2.7.2 Gray scalingThe image is converted to a gray scale.

2.7.3 Image BlurringThe image is blurred with the strength of one.

2.7.4 Correlation of the Marr-Hildreth kernelThe image is correlated with the Marr-Hildreth laplacian of gaussian kernel[10].

2.7.5 Image normalizationThe image is normalized to between 0 and 1.

2.7.6 Hash generationThe image hash is generated to a 72 element array of bytes. A normalized hammingdistance is used to calculate the distance to another image.

14

2.8. SEARCHING HASHES

2.8 Searching hashesThe feasibility of matching an image is not only dependent on the hash algorithmused but also very much on the size of the dataset. The size of that dataset willdepend greatly on how it is constructed, if for example it was constructed takingonly images from scientific reports, it would be significantly smaller than if it wasconstructed by for example crawling the web for all possible images. Even so, thereare two primary ways of searching the dataset, listed below.

2.8.1 Logarithmic searchLogarithmic search speed is very much desired for searching large dataset, since thesearch time only increases logarithmically with the size of the dataset, making itpossible to search a great number of images to find matches. The obvious drawbackis that only zero-distance hashes will be matched, thus missing out on matches thatmight have a very small “closeness” distance.

2.8.2 Linear searchLinear search speed is a consequence of how distance is measured; a hammingdistance must be calculated for the input image hash to all the other image hashesin the dataset. The generation of the hamming distance is fairly quick, but onewould still have to go through the entirety of the dataset, making it unfeasiblefor large datasets. The upside is that even modified images can be matched to adistance threshold, making plagiarism harder.

2.9 ImageHash 0.3 for pythonThere is a perceptual hash library for python[11], including algorithms for aver-age hashing, differential hashing and pHashing with discrete cosine transformation,based on the posts[3, 4]. While their implementation of aHash and dHash seemsto be decent in their implementation, the DCT pHash is not correct, and as suchit will not be used. Instead we will use the pHash library written in c++ by theoriginal algorithm authors for all the pHash calculations.

2.10 pHash for C++The original implementation of the pHash algorithms[12] (Discrete cosine trans-formation, Radial Variance and Marr-Hildreth operator) is available in c++ withhandlers for java, php. Since this implementation is the authority on pHash, it willbe used for all pHash results.

15

Chapter 3

Method

Testing of perceptual hashing to detect visual plagiarism will be carried out by inthe following steps

3.1 Image selectionA subset of 567 images was selected from wikimedia commons “Quality images”category[13]. Due to bandwidth and space constraints, a smaller image was used(600 pixel width).

3.2 Hash selectionThree of the hash algorithms were chosen: aHash serves as the simplest exam-ple of hashing, dHash is a slightly more complex algorithm and pHash is a wellestablished implementation using more advanced methods such as discrete cosinetransformations.

Additionally all three have the same hash size and hash comparison method(Hamming distance) unlike both Marr-Hildreth operation and radial variance hashes.

3.3 Rudimentary image manipulationIf you know the algorithms being used, you can construct a attack image, with theexpressed purpose of fooling the algorithms in question, just as you may in othermediums. Since there is not really any getting around this, we instead focus on theimage manipulation that would be realistic to perform with common photo editorsoftware, Photoshop CC 14 in this case.

3.3.1 No changeThe image is not changed, its simply saved, but it will still be recompressed, as suchit provides a good baseline for what the jpg compression does in and of itself.

17

CHAPTER 3. METHOD

3.3.2 200% ScalingThe image is resized to twice its original size.

3.3.3 Color correctionThe color was changed slightly, as to not make it obvious that manipulation hadtaken place, a -15 color correction on all three color channels was used.

3.3.4 Contrast adjustmentContrast was increased by 40 percent.

3.3.5 Black and white filterThe image was grayscaled using a black and white filter.

3.4 One to One hash calculationThe before and after hashes for each image is compared and its distance calculatedin order to see the effect of each manipulation on the hash values and distances, zerodistances are counted and the average distance is recorded. “False” negatives, IEwhen an image is the same but the algorithm reports them over the threshold (8 forall tests) are also counted. (The number of true positives can obviously be derivedfrom this number, since all images that are not false negatives are true positives inthis case. Eg 567-False Negatives)

3.5 One to Many hash calculationIn order to asses whether or not an algorithm is suitable we also need to know thenumber of false positives, where images that are different are reported to be thesame. This is done by taking an original image and comparing it to every otherimage in the particular dataset, resulting in about 160000 (n*n/2) comparisons ineach test.

18

Chapter 4

Results

Below are the results for the different hash algorithms.The average distance is of interest because if we do a full dataset search (Ham-

ming distance to all data points) we want the number to be as low as possible, ieon average have a good margin to the threshold.

The number of zero distances is of interest because given a good percentage ofexact matches we can do a search in log(n) time.

aHash results No change Color Cor-rection

ContrastAdjustment

Grayscalefiltering

200% scal-ing

Average distance 0.0406 0.0406 0.666 1.444 0.136Number of zero distances 549 549 312 229 503False negatives 0 0 0 7 0False positives 867 867 800 916 857

dHash results No change Color Cor-rection

ContrastAdjustment

Grayscalefiltering

200% scal-ing


pHash results No change Color Cor-rection

ContrastAdjustment

Grayscalefiltering

200% scal-ing


19

Chapter 5

Discussion & Conclusion

Overall the results are positive. For “no change” and “color correction”, the numberof zero distance matches are decent, while all algorithms in all tests have a reasonablematch distance for a threshold search (Ie mostly below 8-10 which is a commonthreshold)

There are things about the algorithms not shown in the results, namely false-matching, which is when two different images are matched as the same. This isespecially common with the simpler algorithms which does not use discrete cosinetransformations.

The results are not very encouraging for log(n) searches in most cases, giventhat only roughly half of images gets a zero-distance match.

dHash fares well, as might be expected, seeing how it is a differential function,and the change in pixel to pixel should be fairly constant with the image manipu-lations chosen here. Ie the color correction will be nullified by the fact that dHashuses the average values to set bits off their difference.

One of the more surprising results is the poor performance for both ahash anddhash in the grayscale test, given that both algorithms converts to grayscale beforecomputation. This might be because photoshop does not weigh all colors equallywhen gray scaling (ie does color analysis before conversion) or because the Image-Hash library does not convert explicitly to grayscale, it simply weights the differentcolor channels together.

It is also quite surprising that pHash struggles at the 200% scaling test, doingworse than both aHash and dHash. I suspect this might be because of the bilinearinterpolation done by the image processor in scaling it up, changing the DCT co-efficients of the image when combined with the DCT compression that JPEG doeswhen resaving the image.

Given the results, a hybrid approach might be appropriate, that is constructinga hash with a higher tolerance for changes to provide a zero-distance match, andthen applying a linear search on that subset of data with more complex and precisealgorithms such as pHash.

21

Bibliography

[1] "Hamming distance in coding theory." 2010. Retrieved 20 Apr. 2015http://www.maths.manchester.ac.uk/ pas/code/notes/part2.pdf

[2] Normalized hamming distancepHash 0.96 source code implementation Row 955-971

[3] “Kind of Like That” Dr. Neal Krawetz 2013. Retrieved 20 Apr. 2015http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

[4] “Looks Like It” Dr. Neal Krawetz 2011. Retrieved 20 Apr. 2015http://www.hackerfactor.com/blog/?/archives/529-Kind-of-Like-That.html

[5] pHash DCT ImagehashpHash 0.96 source code implementation Row 356-401

[6] Discrete Cosine TransformationMark Nixon, Alberto S Aguado. “Feature Extraction & Image Processing forComputer Vision, 3nd Edition.” Page 68-70, 2012

[7] pHash 0.96 source code implementation Row 258-296

[8] Bracewell, R. "Pentagram Notation for Cross Correlation." Page 46 and 243,1965.

[9] pHash Marr–Hildreth ImagehashpHash 0.96 source code implementation Row 847-902

[10] Feature Extraction & Image Processing for Computer Vision, 3nd Edition.Page 165-170 Mark Nixon, Alberto S Aguado

23

BIBLIOGRAPHY

[11] ImageHash 0.3 python Johannes Buchnerhttps://github.com/JohannesBuchner/imagehash

[12] pHash 0.96 Evan Klinger & David Starkweatherhttp://www.phash.org/releases/pHash-0.9.6.tar.gz

[13] Wikimedia commons - Quility images, Multiple authorshttp://commons.wikimedia.org/wiki/Category:Quality_images Retrieved 20Apr. 2015

24

www.kth.se

Documents

Detecting visual plagiarism with perception hashing812148/FULLTEXT01.pdfDEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL STOCKHOLM, SWEDEN 2015 Detecting visual plagiarism with perception