Contrastive divergence in RBM

Question

I have the following code, where x in the input data, w is the weight matrix, bv and bh are the biases for the visible and hidden units.

import theano.tensor as T

x_states = x > numpy.random.rand(training_examples, feats)

hid =  T.nnet.sigmoid(T.dot(x_states, w) + bh)    # Activation function of hidden layer
hid_states = hid > numpy.random.rand(training_examples, nhidden)

# Construct Theano expression graph
vis = T.nnet.sigmoid(T.dot(hid_states, w.T) + by) 
vis_states = vis > numpy.random.rand(training_examples, feats)

hid2 = T.nnet.sigmoid(T.dot(vis_states, w) + bh)
hid2_states = hid2 > numpy.random.rand(training_examples, nhidden)

xent = T.sum((x - vis)**2)
parameters = [w, bh, by]  # this line defines all parameters for each layer
cost = xent.mean()

pos_associations = T.dot(x.T, hid)
neg_associations = T.dot(vis.T, hid2)  

wparameters = [w]
byparameters = [by]
bhparameters = [bh]

update = []
for wparam, byparam, bhparam in zip(wparameters, byparameters, bhparameters):
    update.append((wparam, wparam + tr_rate * (pos_associations - neg_associations)))
    update.append((byparam, byparam + tr_rate*(T.sum(x.T[:]) - T.sum(vis.T[:]))))      
    update.append((bhparam, bhparam + tr_rate*(T.sum(hid.T[:]) - T.sum(hid2.T[:]))))


train = theano.function(
          inputs=[x],
          outputs=[cost],
          updates=update)

What is the correct implementation for the update? In my example this is clearly not working. I tested this on the classic MNIST dataset.

This is the result I get for the cost. As you can see it does not decreases. What is wrong?

Epoch: 0
cost =  [ 969572.73014003]
Epoch: 1
cost =  [ 258872.77507019]
Epoch: 2
cost =  [ 258872.77507019]
Epoch: 3
cost =  [ 258872.77507019]
Epoch: 4
cost =  [ 258872.77507019]
Epoch: 5
cost =  [ 2003326.79850769]
Epoch: 6
cost =  [ 258872.77507019]
Epoch: 7
cost =  [ 258872.77507019]
Epoch: 8
cost =  [ 258872.77507019]
Epoch: 9
cost =  [ 258872.77507019]
Epoch: 10
cost =  [ 258872.77507019]
Epoch: 11
cost =  [ 258872.77507019]
Epoch: 12
cost =  [ 2003326.79850769]
Epoch: 13
cost =  [ 258872.77507019]
Epoch: 14
cost =  [ 258872.77507019]
Epoch: 15
cost =  [ 258872.77507019]
Epoch: 16
cost =  [ 258872.77507019]
Epoch: 17
cost =  [ 258872.77507019]
Epoch: 18
cost =  [ 258872.77507019]
Epoch: 19
cost =  [ 258872.77507019]
Epoch: 20
cost =  [ 2003326.79850769]
Epoch: 21
cost =  [ 258872.77507019]
Epoch: 22
cost =  [ 258872.77507019]
Epoch: 23
cost =  [ 258872.77507019]
Epoch: 24
cost =  [ 258872.77507019]
Epoch: 25
cost =  [ 258872.77507019]
Epoch: 26
cost =  [ 258872.77507019]
Epoch: 27
cost =  [ 258872.77507019]
Epoch: 28
cost =  [ 2003326.79850769]
Epoch: 29
cost =  [ 258872.77507019]
Epoch: 30
cost =  [ 258872.77507019]
Epoch: 31
cost =  [ 258872.77507019]
Epoch: 32
cost =  [ 258872.77507019]
Epoch: 33
cost =  [ 258872.77507019]
Epoch: 34
cost =  [ 258872.77507019]
Epoch: 35
cost =  [ 258872.77507019]
Epoch: 36
cost =  [ 2003326.79850769]
Epoch: 37
cost =  [ 258872.77507019]
Epoch: 38
cost =  [ 258872.77507019]
Epoch: 39
cost =  [ 258872.77507019]
Epoch: 40
cost =  [ 258872.77507019]
Epoch: 41
cost =  [ 258872.77507019]
Epoch: 42
cost =  [ 258872.77507019]
Epoch: 43
cost =  [ 2003326.79850769]
Epoch: 44
cost =  [ 258872.77507019]
Epoch: 45
cost =  [ 258872.77507019]
Epoch: 46
cost =  [ 258872.77507019]
Epoch: 47
cost =  [ 258872.77507019]
Epoch: 48
cost =  [ 258872.77507019]
Epoch: 49
cost =  [ 258872.77507019]
Epoch: 50
cost =  [ 258872.77507019]
Epoch: 51
cost =  [ 2003326.79850769]
Epoch: 52
cost =  [ 258872.77507019]
Epoch: 53
cost =  [ 258872.77507019]
Epoch: 54
cost =  [ 258872.77507019]
Epoch: 55
cost =  [ 258872.77507019]
Epoch: 56
cost =  [ 258872.77507019]
Epoch: 57
cost =  [ 258872.77507019]
Epoch: 58
cost =  [ 258872.77507019]
Epoch: 59
cost =  [ 2003326.79850769]
Epoch: 60
cost =  [ 258872.77507019]
Epoch: 61
cost =  [ 258872.77507019]
Epoch: 62
cost =  [ 258872.77507019]
Epoch: 63
cost =  [ 258872.77507019]
Epoch: 64
cost =  [ 258872.77507019]
Epoch: 65
cost =  [ 258872.77507019]
Epoch: 66
cost =  [ 258872.77507019]
Epoch: 67
cost =  [ 2003326.79850769]
Epoch: 68
cost =  [ 258872.77507019]
Epoch: 69
cost =  [ 258872.77507019]
Epoch: 70
cost =  [ 258872.77507019]
Epoch: 71
cost =  [ 258872.77507019]
Epoch: 72
cost =  [ 258872.77507019]
Epoch: 73
cost =  [ 258872.77507019]
Epoch: 74
cost =  [ 2003326.79850769]
Epoch: 75
cost =  [ 258872.77507019]
Epoch: 76
cost =  [ 258872.77507019]
Epoch: 77
cost =  [ 258872.77507019]
Epoch: 78
cost =  [ 258872.77507019]
Epoch: 79
cost =  [ 258872.77507019]
Epoch: 80
cost =  [ 258872.77507019]
Epoch: 81
cost =  [ 258872.77507019]
Epoch: 82
cost =  [ 2003326.79850769]
Epoch: 83
cost =  [ 258872.77507019]
Epoch: 84
cost =  [ 258872.77507019]
Epoch: 85
cost =  [ 258872.77507019]
Epoch: 86
cost =  [ 258872.77507019]
Epoch: 87
cost =  [ 258872.77507019]
Epoch: 88
cost =  [ 258872.77507019]
Epoch: 89
cost =  [ 258872.77507019]
Epoch: 90
cost =  [ 2003326.79850769]
Epoch: 91
cost =  [ 258872.77507019]
Epoch: 92
cost =  [ 258872.77507019]
Epoch: 93
cost =  [ 258872.77507019]
Epoch: 94
cost =  [ 258872.77507019]
Epoch: 95
cost =  [ 258872.77507019]
Epoch: 96
cost =  [ 258872.77507019]
Epoch: 97
cost =  [ 258872.77507019]
Epoch: 98
cost =  [ 2003326.79850769]
Epoch: 99
cost =  [ 258872.77507019]

Just a small suggestion; for the sake of clarity you should include in your code that "T" stands for Theano operations. The same for your question text. A toy dataset with the same characteristics as your real data would facilitate reproducibility. — wacax, Jan 19 '16 at 23:19
I don't see any sampling in your code. E.g. for negative phase of hidden variables you should have something like `hid_means = T.nnet.sigmoid(T.dot(x, w) + bh); hid = sample(hid_means)`, where `sample(means) = int(rand() <= means)` for binary hidden variables. Also how do you conclude that it doesn't work? How many epochs did you use and how was `costs` changing with each epoch? — ffriend, Jan 20 '16 at 11:38
I slightly changed my code. Is that what you mean by sampling? — user, Jan 20 '16 at 13:58
I used 100 epochs. Cost initially decreases, then it stabilizes (goes up and down a little but it does not decrease) — user, Jan 20 '16 at 13:59
You are right, would that cause the problem I am seeing? I am going to try that to see if I get a better result. — user, Jan 20 '16 at 15:30
Yes, that's what I meant by sampling. Note, that even though hidden variables are almost always binary, for visible variables you can (and in context of images even recommended) to use Gaussian or even use mean values directly (i.e. binary sampling for hiddens, no sampling at all for visible) . Anyway, more important is how you evaluate the result. For MNIST it's very convenient to visualize learned weights after each epoch, i.e. reshaping every each wait to original image's size and showing it. If you had some decrease in cost for some time, you could just learn good representation. — ffriend, Jan 20 '16 at 22:56
In this case your weights may look something like [this](http://people.idsia.ch/~masci/software/filters_at_epoch_14.png) (maybe a little bit worse, but not just random noise). Also it's important to choose good number of hidden variables - for 20x20 MNIST images number between 40 and 100 should be suitable. — ffriend, Jan 20 '16 at 23:00
I added updates for the biases. I still don't think I am getting correct results. I can post the whole code if anyone is interested, but all I am doing is loading the mnist dataset and that is correct because I can visualise it. I posted the results, so people can see what I mean. I understand @ffriend point about no sampling for the visible, but this still looks suspicious even the way I do it. — user, Jan 21 '16 at 08:08
Could you please post the whole code (say, to [pastebin](http://pastebin.com/))? Current code has a lot of missing details like number of hidden units, weight initialization, training rate and so on, and the issue may be in any of them. — ffriend, Jan 21 '16 at 14:04
Actually, if I corrected the code and if I take the average for positive_association and negative_association (by dividing by training_examples) and instead of T.sum I take T.mean for the biases, things work better, though the cost goes down very slowly. — user, Jan 21 '16 at 14:35
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/34680/discussion-between-ffriend-and-user). — ffriend, Jan 21 '16 at 21:56

Contrastive divergence in RBM

0 Answers0