The same link shows how these features are extracted, with a deep look into the cited article "Image-based recommendations on styles and substitutes":
Features are calculated from the original images using the
Caffe deep learning framework [11]. In particular, we used a
Caffe reference model with 5 convolutional layers followed
by 3 fully-connected layers, which has been pre-trained on
1.2 million ImageNet (ILSVRC2010) images. We use the
output of FC7, the second fully-connected layer, which results
in a feature vector of length F = 4096.
The reference neural network that was mentioned is the BAIR Reference CaffeNet at the Caffe Model Zoo, which is a slightly modified version of AlexNet.
Since the model was trained over ImageNet, which contains a wide variety of photographs of various categories (1000 of them, if I recall correctly), retrieving the neural codes of one of the layers (obtained just by forward propagation) will give you visual features with a fair representation of the images, even if the network was not specifically trained for Amazon's tasks (such as product recommendation). What these values actually mean is not something that tangible: it is the outcome of multiple 2D convolutions and other normalization and regularization functions, the parameters of which were adjusted specifically for classifying photographs from ImageNet.
The FC7 layer has a rectified linear unit activation (ReLU), which means that they are all non-negative numbers (potentially with several zeros). And since it's a fully connected layer that follows several convolutions, there is no intuitive mapping between a feature index and a certain characteristic of the image. You may picture the network as a highly complex function that yields a high-level representation of the image, under the form of a vector of numbers.
See also the paper Neural Codes for Image Retrieval, where the authors retrieve features from a pre-trained neural network in this fashion, for retrieving images in a different image domain.