Variance computation

pcl · Post by **pcl** » 2016-11-16T09:09:18-07:00

I'm new here and although I've used the search button to see whether the issue described below has been discussed in the past, I've not found anything. Apologies if it's an old one.

The identify utility can be persuaded to give statistical information about the channels in an image, specifically the minimum, maximum, mean and standard deviation. I'm part way through writing some code which also computes the same statistics. The first three agree exactly with what identify computes. The standard deviation not only differs, I know exactly how it differs.

According to all good stats books the S.D. of N values is defined as (I hope the pseudo-\Tex notation is clear)

sqrt ((1/N-1) * Sum_{i=1}^N (x_i - mean)^2)

when the mean is estimated from the data. If the mean is known a priori, then, and only then, the denominator (N-1) should be replaced by N.

My code uses N-1. If it is changed to N the result agrees with identify's.

My claim is that the mean of an arbitrary image is not an a priori known quantity and must be estimated from the data. Consequently, I assert there is a bug in the ImageMagic code.

Has this been discussed before? How should I submit a bug report?

Paul

Post by **fmw42** » 2016-11-16T09:50:41-07:00

See https://en.wikipedia.org/wiki/Variance. Using n and not n-1 (or n+1 or n-1.5) is the population variance. It is also the variance of a set of n equally likely values. With images with large numbers of pixels the difference between n and n-1, is probably not significant.

If you want the unbiased std using the value (n-1), then just take

unbiased_std = sqrt(im_std*n/(n-1))

pcl · Post by **pcl** » 2016-11-16T10:06:24-07:00

fmw42 wrote:See https://en.wikipedia.org/wiki/Variance. Using n and not n-1 (or n+1 or n-1.5) is the population variance. It is also the variance of a set of n equally likely values. With images with large numbers of pixels the difference between n and n-1, is probably not significant.

If you want the unbiased std using the value (n-1), then just take

unbiased_std = sqrt(im_std*n/(n-1))

Thanks.

I can see that this will be a tar pit. The same Wiki article also says (emphasis mine)

Firstly, if the omniscient mean is unknown (and is computed as the sample mean), then the sample variance is a biased estimator: it underestimates the variance by a factor of (n − 1) / n; correcting by this factor (dividing by n − 1 instead of n) is called Bessel's correction. The resulting estimator is unbiased, and is called the (corrected) sample variance or unbiased sample variance. For example, when n = 1 the variance of a single observation about the sample mean (itself) is obviously zero regardless of the population variance. If the mean is determined in some other way than from the same samples used to estimate the variance then this bias does not arise and the variance can safely be estimated as that of the samples about the (independently known) mean.

I completely agree that the difference in practice will usually be small and that if the difference is significant you should be very careful but, nonetheless, I suggest that an unbiased estimate should be computed by default.

Post by **fmw42** » 2016-11-16T10:27:51-07:00

I understand your concern. But I will leave that to the IM developers.

Post by **snibgo** » 2016-11-16T10:43:59-07:00

I'll just comment that I often use "-statistic StandardDeviation 3x3", which replaces each pixel by the SD of the 3x3 pixel window. With only 9 pixels, I suppose the difference is about 10%.

I'm not judging which should be used: N or N-1.

Post by **fmw42** » 2016-11-16T11:20:52-07:00

One possible solution would be a new define to set the std to unbiased. -define identify:std=unbiased If specified then n-1 would be used. If not specified, then n (backward compatible) will be used.

Post by **fmw42** » 2016-11-16T11:27:30-07:00

I would add that it raises other issues, such as what to do about -statistic and about fx:standard_deviation and %[standard_deviation]. So there would need to be a -define std=unbiased that would work with both convert/magick and identify.

Post by **magick** » 2016-11-19T19:54:34-07:00

PCL is correct. The denominator should be N-1 when dealing with samples to ensure an unbiased estimator for the standard deviation, as such it is a bug and should be corrected without the need for a define. As Fred mentioned, the correction would need to apply wherever the standard deviation of the samples are computed.

Post by **fmw42** » 2016-11-19T20:29:28-07:00

One problem with doing that. A 1 pixel image will have an undefined std, due to a divide by 0 (1-1=0). So you will need to trap for that special case and report NAN or inf or some other value. This is fine for identify and %[standard_deviation], but may cause issues with -statistic and -fx.

pcl · Post by **pcl** » 2016-11-23T02:34:32-07:00

fmw42 wrote:One problem with doing that. A 1 pixel image will have an undefined std, due to a divide by 0 (1-1=0). So you will need to trap for that special case and report NAN or inf or some other value. This is fine for identify and %[standard_deviation], but may cause issues with -statistic and -fx.

NO!

The standard deviation of a single point is identically zero. Think about it: there are no deviating points and so no deviation. See also the portion of the Wikipedia article I quoted above.

If you would like a more mathematically rigorous explanation, let N be a continuous real variable. Sums in the discrete case become integrals in the continuous counterpart; evaluate the limit of the integral expression for the S.D. as N tends to one from above.

pcl · Post by **pcl** » 2016-11-23T10:08:52-07:00

pcl wrote:The standard deviation of a single point is identically zero. Think about it: there are no deviating points and so no deviation.

I could have explained that rather better as the reasoning goes to the heart of this topic. There are no deviating points only if you've already used the sole point to determine the mean.

A thought experiment may help. I have a large bag containing one trillion numbers. The numbers are bounded above and below by (possibly different) finite values but I'm not going to tell you what the limits are to avoid tempting you into leaping to unjustified conclusions. The constraints given so far are to avoid problems with infinite sets. Neither am I going to tell you whether the numbers are all the same, all different or contain any particular type of pattern, again to stop you leaping to unjustified conclusions.

I invite you to draw a number from the bag. It has the value 42. What is your best estimate for the mean of the numbers in the bag? You would (at, least I hope you would) answer 42. What is your best estimate for the value of the standard deviation from the mean of the numbers in the bag? You have no idea, so would answer 0.

After you've done that, I then tell you that I've already looked at all the numbers in the bag and that they sum to zero. Therefore the value of the mean is undoubtedly zero and your previously made best estimate differs from it. However, what you can now do is estimate the standard deviation from your sample. It will be a poor estimate to be sure, though you don't yet know how inaccurate it is.

Is that clearer? The N-1 in the denominator compensates you you having used 1 sample to determine the mean and only the remaining N-1 samples can be used to estimate the variance in that quantity. If, on the other hand, you know a priori that the mean is zero you can now use all N samples to estimate the variance.

HTH.

Post by **snibgo** » 2016-11-23T10:36:30-07:00

pcl wrote:What is your best estimate for the value of the standard deviation from the mean of the numbers in the bag? You have no idea, so would answer 0.

I'm no mathematician, so feel free to shoot me down in flames, but ...

Drawing one number from the bag (and knowing nothing else) tells us nothing about the SD of the entire population. Agreed. But that means there is no "best estimate", and the answer "zero" is no better than any other number.

It seems to me that the SD of a single number is zero, because there is no variance. Fair enough. But the estimated SD of a population, when we are given a single known sample, is indeterminate.

Thus, if we want to know the SD of a group size N, we divide by N. (N may be 1.) But if we want to estimate the SD of the total population from a sample, we should divide by (N-1), and this is zero when we have only a single sample, so we get the correct answer which is "we don't know".

Post by **fmw42** » 2016-11-23T11:33:12-07:00

My point was simply from a computational perspective. The code for doing the standard deviation does not know about this special case. It has a specific mathematical computation where it computes the numerator and divides by n-1. So if n=1, you have a divide by zero and in fact have 0/0, which is indeterminant. From a proper mathematica point of view you may be correct, but computationally using the specific formula, you get indeterminate. So the algorithm would have to trap out the special case whether it should be 0 or undetermined.

Post by **magick** » 2016-11-23T11:48:58-07:00

ImageMagick has a macro to properly handle divide by zero contributed by Nicolas Robidoux. Instead of dividing by 0, we multiply by 1/N where N is Infinitesimally small for N=0. We utilize this macro throughout the ImageMagick code base to prevent divide by zero exceptions or numerical instability.

Post by **fmw42** » 2016-11-23T12:19:36-07:00

My point is if pci is correct or if it is indeterminate. A special case may(?) need to be made for the single pixel case to get the proper mathematical result.

Basically it makes no sense to compute the std of one sample. From http://stackoverflow.com/questions/8023 ... ne-element

"Performing a calculation of the standard deviation over a population of size 1 makes absolutely no sense, which I think is where the confusion is coming from. If you know that your population contains only one element then finding out the standard deviation of that element is pointless, so generally you will see the standard deviation of a single element written as undefined."

and

"Standard deviation - which is a measure for the deviation of the actual value from the average of a given set - for a list of one element doesn't make any sense (you can set it to 0 if you want)."

So take your pick of what you want IM to specify, since it is really pointless to compute the unbiased std of 1 sample.

Legacy ImageMagick Discussions Archive

Variance computation

Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation

Re: Variance computation