I have thousands of webcam images in jpeg format which I want to analyze for cloud cover. Images with no cloud will be a mountain and will be relatively dark while if there is cloud the histogram is skewed to bright levels. The aim is to classify images into cloud or no cloud and then analyze the frequency of cloud by time of day, season, etc.
First the jpegs are imported using
raster. I intend to compile the images into a RasterStack and specify a window within each image for the analysis, e.g.
rasterlist <- list.files('test', full.names=TRUE) # list of files to make rasterstack
test_stack <- stack(rasterlist) # stack raster layers
#define the crop extent
cropbox <-c(200,1280,500,900) # (`xmin`, `xmax`, `ymin` , `ymax`)
test_cr <- crop(test, cropbox) #crop the raster
I am looking for a robust dissimilarity measure to compare all of the images in the stack to a reference image using histograms. The more threads I read the more metrics I come across as suggestions! This SO thread provides a good starting point.
The jpeg pixels have 256 values when unbinned. My understanding is the Kolmogorov-Smirnov distance is an appropriate measure in this case, or chi-squared distance if the data are binned (which may be necessary to improve computation speed).
Several discussions on this topic recommend Wasserstein or Earth Movers Distance (EMD) but when I try using this (package transport) on a pair of test images (each 1080x250 pixels) I get an error msg (apparently my 16GB of RAM is not nearly enough!):
# first need to convert raster to a matrix then to object type pgrid
testmat <- as.matrix(test_cr2)
testmat2 <- as.matrix(test_cr3)
p1 <- pgrid(testmat)
p2 <- pgrid(testmat2)
wasserstein(p1,p2,p=1) # calculate EMD
Error: cannot allocate vector of size 271.6 Gb
I have had more success with HistogramDistance from the HistogramTools package, e.g.
# HistogramDistance requires 2 histogram objects with same bucket boundaries
h1 <- hist(test_cr2, breaks=seq(0,256,by=8))
h2 <- hist(test_cr3, breaks=seq(0,256,by=8))
minkowski.dist(h1, h2, 1) # 1 = manhattan dist
But I can find little information about the relative merits of the 4 different similarity metrics that HistogramDistance uses:
The minkowski.dist function computes the Minkowski distance of order p between two histograms. p=1 is the Manhattan distance and p=2 is the Euclidean distance.
The intersect.dist function computes the intersection distance of two histograms, as defined in Swain and Ballard 1991, p15. If histograms h1 and h2 do not contain the same total of counts, then this metric will not be symmetric.
The kl.divergence function computes the Kullback-Leibler divergence between two histograms.
The jeffrey.divergence function computes the Jeffrey divergence between two histograms.
Is Minkowski distance suited to my purpose? And if so which is better, manhattan or euclidean distance?
I can find no actual examples of any of these methods being implemented for histogram similarity in R.
Firstly, I'm after advice on which are the most useful methods that are relatively easy to implement in R for a large dataset (thousands of histograms).
Secondly, examples of any of these methods implemented in R for histogram comparison. (I can't find any relevant code online).