Perceptual Hashing of Images

BY MATTHEW LOWRY

The D2D CRC’s Apostle program is developing tools to help search and analyse various data sources, including multimedia data. Our footprint application focusses on online activity. When the application generates a summary of online activity, part of the summary is a display of images. Ideally, this display will show a collection of images that give a broad overview of the nature of the images being posted and the range of concepts and objects contained in them. Algorithms for achieving this are a matter for ongoing research and a topic for the CRC's “Picturing Knowledge” research stream, led by Associate Professor Lexing Xie at Australian National University.

As a temporary measure, we have included a placeholder algorithm that simply displays a filtered list of the recent images within the online activity the application is analysing. The filtering is done using a simple perceptual hashing method to remove duplicates and improve the visual diversity of the displayed images. In this post we will give a brief introduction to the concept of perceptual hashing of images, and demonstrate the simple technique we used.

Perceptual Hashing

The term hashing in its general sense refers to processing data to generate some kind of "fingerprint" that represents the data. Often the term is used to refer to the specific area of cryptographic hash functions. These offer strong guarantees that the hash value produced for data is well distributed and "tamper evident" (changing the data - even a single bit of it - will cause a large change in the hash value produced) and resistant to collision attacks (someone who knows the hash value of a certain piece of data cannot deliberately generate "fake" data that will have the same hash value).

However, there are other types of hashing where different algorithmic properties are desired. Perceptual hashing aims to generate hash values from multimedia (images, audio, movies) where a change that a person would perceive as small will result in a similarly small change to the hash value. The classic use case is for copyright infringement - for example detecting if the user of an image sharing website is posting a known image that is under copyright protection. If this was attempted with a traditional cryptographic hash function, a user could defeat the measure by making an imperceptible change to the colour of a single pixel in the image - the cryptographic hash function would (by design) generate a very different hash value. In contrast a perceptual hash function would produce the same hash value.

Perceptual hashing is closely related to the concept of locality-sensitive hashing, but with perceptual hashing the concern is a person's subjective perception rather than a formalised definition in terms of distances within a metric space.

Mean DCT Hash

A wide variety of techniques for perceptual hashing of images have been proposed in academic literature. A popular approach is to use the Discrete Cosine Transform (DCT) and base the hash value on the spectral decomposition of the image. A simple but effective way to apply the DCT is called "Mean Hash" or "Average Hash".

In simple terms, the algorithm is as follows:

Mean DCT Hash

  1. Scale the image to 32x32 pixels.
  2. If the image is in colour, convert it to grayscale.
  3. Apply the DCT (producing a 32x32 matrix of DCT coefficients).
  4. Take the upper-left 8x8 submatrix of the DCT coefficients.
  5. Ignoring the first matrix element, do:
    1. Compute the mean element value.
    2. Map each element value to either:
      1 (if greater than or equal to the mean), or
      0 (if less than the mean)
    3. Coalesce the mapped elements to a linear bit string.

The resulting length-63 bit string is the hash value for the image. To compare hash values, use the Hamming distance between the bit strings generated from the images.

In theory the image can be scaled down to any width or height, since the DCT (strictly speaking - "type II" DCT) can be applied to a real-valued matrix of any size. However in practice a 32x32 pixel downsizing is convenient for many reasons. Notably there are computational tricks that can be used when the matrix is square and the size is a power of two to speed up the DCT computation.

The motive for taking an upper-left submatrix of the DCT coefficients is to obtain a hash value that depends on the properties of the image at lower (spatial) frequencies and will not be altered by high frequency changes to an image.

Using an 8x8 submatrix in particular (rather than some other size) generally works well assuming the DCT was applied to a 32x32 matrix. If the DCT is applied to a larger or smaller matrix, the size of the submatrix for selecting the lower frequencies aspects of the image would need to be adjusted. However, the 8x8 submatrix size is also convenient because it generates a hash value that can be packed into a 64-bit integer value.

The motive for ignoring the first element of the submatrix is that this element contains the "DC coefficient" - a value that represents the average pixel value of the whole image. This coefficient is typically much higher than all the other coefficients and distorts the mean value unless it is ignored.

Example

Here is an example of using the above algorithm.

Consider the following image:

oaty McBoatface

CREDIT: NATIONAL OCEANOGRAPHY CENTRE/PA

boaty.jpg (620x348)

Pass this through the algorithm and the resulting hash value is:
boaty.jpg = 010110001100010011100101001011001100011011110011110011000111001

Now let's compare the hash value with some other image:

rumpf

CREDIT: Crooks and Liars (with apologies to Toot the Tiny Tugboat)

drumpf.jpg (640x470)

drumpf.jpg = 111000000110000101011001000101110010000001000101010110101000101

Now we can evaluate the distance between these images:
d(boaty.jpg, drumpf.jpg) = 37

Note that the expected hamming distance between random length-63 bit strings would be 31.5; so the perceptual hashes of these images have correctly inferred that these images are not like each other.

Now let’s try a different image:

oaty McBoatface again

CREDIT: CNN 

boaty2.jpg (1100x619)

boaty2.jpg = 010110101100101100101101001011001101011111010011110010000001011

d(boaty.jpg, boaty2.jpg) = 15

This image is, to the perception of a person, similar to the first. Although the lighting and colour balance is different, the camera view is wider, and the image resolution is significantly higher, the perceptual hash has recognised the closeness to the first image.

Lastly, lets compare the first image to this one:

oaty McAwesomeface

boaty_mcawesomeface.jpg (600x300)

boaty_mcawesomeface.jpg = 010110001100010011100101001011001100011011110010110011000111001

d(boaty.jpg, boaty_mcawesomeface.jpg) = 1

Here the first image has been resized by a small amount and "defaced" (ha ha hah sorry). The perceptual hash has concluded that it is basically the same image.

Conclusion

The Mean DCT Hash algorithm is an efficient and computationally cheap method for generating hash values from images so that the hash value captures the low-frequency intensity of the image. Images that have small differences in their low-frequency structure will receive hash values with a small difference. These hash values can be effectively used to filter a collection of images shown in a summary of online activity to eliminate duplicates and improve the visual diversity of the images shown.