6

I am a novice at the data science, and I notice some repository state the mean value and deviation in MNIST dataset are 0.1307 and 0.3081.

I cannot imagine how these two numbers come from. Based on my understanding, the MNIST dataset has 60,000 pics and each of them has (28 * 28 = 784) features. How do I convert this feature vectors to get the mean and deviation?

Especially, this should classify by the label, right? For example, the number 0 should have its mean and deviation. For number 1 should also have its mean and deviation.

timleathart
  • 3,900
  • 20
  • 35
rj487
  • 195
  • 2
  • 5

3 Answers3

2
  • mean : It is the mean of all pixel values in the dataset ( 60000 × 28 × 28 ). This mean is calculated over the whole dataset.
  • deviation : It is the standard deviation of all pixel values. The dataset is treated as a population rather than a sample.

What are the uses of these values?

Mean and standard deviation are commonly used to standardize the data in this case the images. Standardized data has mean close to 0 and standard deviation close to 1. You can read more here.

Why to standardize the data?

Standardization transforms your data in such a manner that it has unit variance. According to Wikipedia,

In statistics, the standard score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured

Shubham Panchal
  • 2,140
  • 8
  • 21
1

The repository is simply stating that amongst all features and all examples, the mean value is 0.1307 and the standard deviation is 0.3081. You can get these values yourself, if you have the mnist training set loaded into a numpy array called mnist, by simply evaluating the methods mnist.mean() and mnist.std().

timleathart
  • 3,900
  • 20
  • 35
1

Hope it helps!

self.train_x_set = self.train_x_set / 255.
mean = 0
std = 0
for x in self.train_x_set:
    mean += np.mean(x[0, :, :])
mean /= len(self.train_x_set)
self.train_x_set -= mean
for x in self.train_x_set:
    std += np.mean(np.square(x[0, :, :]).flatten())
std = np.sqrt(std / len(self.train_x_set))
Kenn Wang
  • 11
  • 1