Image Feature Vectors

Question

I have downloaded a dataset from Amazon. http://jmcauley.ucsd.edu/data/amazon/ Dataset involves feature vectors of images. There are around 1.5 M feature vectors.

Dataset consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product).

Every product image involves feature vectors with (4096x1) size. Feature vectors involve float numbers.

What do these float numbers mean?

What I understood is, there are at total 4096 features, and each index of feature vectors indicate a specific feature. The values in feature vectors indicate the frequency of regarding feature in all specific image.

Is it so? Or, if it is not, what might be the right explanation?

Thanks,

At this link; http://jmcauley.ucsd.edu/data/amazon/, you can find more information. — yusuf, Jul 22 '17 at 10:42

E_net4 · Accepted Answer · 2017-07-22T12:02:27.087

1

The same link shows how these features are extracted, with a deep look into the cited article "Image-based recommendations on styles and substitutes":

Features are calculated from the original images using the Caffe deep learning framework [11]. In particular, we used a Caffe reference model with 5 convolutional layers followed by 3 fully-connected layers, which has been pre-trained on 1.2 million ImageNet (ILSVRC2010) images. We use the output of FC7, the second fully-connected layer, which results in a feature vector of length F = 4096.

The reference neural network that was mentioned is the BAIR Reference CaffeNet at the Caffe Model Zoo, which is a slightly modified version of AlexNet.

Since the model was trained over ImageNet, which contains a wide variety of photographs of various categories (1000 of them, if I recall correctly), retrieving the neural codes of one of the layers (obtained just by forward propagation) will give you visual features with a fair representation of the images, even if the network was not specifically trained for Amazon's tasks (such as product recommendation). What these values actually mean is not something that tangible: it is the outcome of multiple 2D convolutions and other normalization and regularization functions, the parameters of which were adjusted specifically for classifying photographs from ImageNet.

The FC7 layer has a rectified linear unit activation (ReLU), which means that they are all non-negative numbers (potentially with several zeros). And since it's a fully connected layer that follows several convolutions, there is no intuitive mapping between a feature index and a certain characteristic of the image. You may picture the network as a highly complex function that yields a high-level representation of the image, under the form of a vector of numbers.

See also the paper Neural Codes for Image Retrieval, where the authors retrieve features from a pre-trained neural network in this fashion, for retrieving images in a different image domain.

edited Jul 22 '17 at 12:02

answered Jul 22 '17 at 10:59

E_net4

364
5
14

So, you mean that these indexes in feature vectors are not visual bag of words. What I have to do in order to get visual bag of words by using these features? Is it a much work? – yusuf Jul 22 '17 at 11:14
1

Obtaining a bag of words with these features is usually unnecessary, because the vector of 4096 values already represents the entire image. You may picture the neural network as a complex composition of convolutions doing specific visual detections (edge, colour, ...) and adding the resulting activations together with some non-linearities in between. However, it is possible to reduce the dimensionality of these features in a number of ways. In Amazon's paper, they created a K-dimensional embedding of these features. – E_net4 Jul 22 '17 at 11:27
So, is it possible to reduce it to 100x1, for example? Because I need to feed it to RNN. – yusuf Jul 22 '17 at 11:29
1

The short answer to that is yes. The long answer is that there may be many feasible ways to feed these features to an RNN. We could indeed reduce the dimensionality of the visual features, either by extracting neural codes from a new pre-trained network with a 100-dimensional layer, or just making an autoencoder or RBM. One could probably also feed the RNN with partitions of the features. This could make a new question of its own, if sufficiently focused to make it answerable. – E_net4 Jul 22 '17 at 11:34
Do each feature vectors have a sequence distribution, or the index of the values are not important? Besides, could you please tell me, how I can shorten the feature vectors? And could you please explain as you are explaining to a dummy guy? Because I have never done feature extraction before. – yusuf Jul 22 '17 at 11:45
I have updated the answer. On the other hand, it seems that you could use some further reading for the second question, which is a bit too open-ended to be covered here. There's the question [Is there any difference between feature extraction and feature learning?](https://datascience.stackexchange.com/q/8768/17538) and this paper on representation learning [Representation Learning: A Review and New Perspectives](http://ieeexplore.ieee.org/abstract/document/6472238/) is a strong overview on the subject of feature learning, where deep learning is often employed. – E_net4 Jul 22 '17 at 12:04
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/62635/discussion-between-yusuf-and-e-net4). – yusuf Jul 22 '17 at 12:14

Image Feature Vectors

1 Answers1