1

I apologise if this is a bit long winded, but it was suggested by another user that I post.

I will start by saying that I am very new to the world of machine learning and deep learning. As such, the most important thing I am after is the understanding of what I am doing.

I am trying to build an ANN for binary classification.

I have a binary feature matrix in the form of N x D, where N is number of samples and D is number of features. In my dataset N has a max of ~ 2 million - but I run my testing on ~ 500k due to the time it takes to run (even when utilising my GPU). If I get a promising validation curve after testing, I will run on the full dataset to verify. D in my dataset is 5. Thus, I have a feature matrix of the form 500000 x 5. A snippet below:

[[0 1 0 1 1]
 [0 1 1 0 1]
 [0 0 1 0 1]
 [1 1 0 1 1]
 [1 1 0 0 0]
 [0 1 1 0 1]
 [0 1 0 0 0]
 [1 1 0 1 1]
 [0 0 0 1 1]
 [1 0 0 0 1]]

I have a target matrix in binary form, snippet below:

[1 0 1 1 0 0 1 0 1 1]

Based on my understanding, for binary classification, the Input layer should be the same as D, and your output layer should have 1 node and it should have a sigmoid activation function.

Thus, I have taken this approach. Now, I also understand that machine/ deep learning is a lot of experimentation and so I have gone through many different iterations of changing the number of hidden layers, as well as changing the number of nodes per hidden layer - all to no seemingly obvious benefit. I have also played around with the following: learning rate (for Adam optimiser), train to test ratio (currently at 0.33), random state variable (when splitting the dataset - currently at 42), batch size (currently at 128), epochs (currently at 50). All of this experimentation with these variables still lead to a high validation loss, as I will show further below.

Now, for my code. Below is my code to split the data in to train and test.

train_to_test_ratio = 0.33
random_state_ = 42

# split the data into train and test sets
# this lets us simulate how our model will perform in the future
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=train_to_test_ratio, random_state=random_state_)
N, D = x_train.shape

Below is my code to to build the model. You can see I currently have 3 hidden layers. I have played around this a lot - from 1 hidden layer to 10 with varying nodes per layer. The best I have found is the below 3 hidden layers. The learning rate is currently 0.001 which creates the best loss curve - anything bigger is too high.

learning_rate = 0.001

# build the model
i = Input(shape=(look_back_period,))
x = Dense(8, activation="relu")(i)
x = Dense(16, activation="relu")(x)
x = Dense(32, activation="relu")(x)
#x = Dense(32, activation="relu")(x)
#x = Dense(64, activation="relu")(x)
#x = Dense(128, activation="relu")(x)
#x = Dense(64, activation="relu")(x)
#x = Dense(32, activation="relu")(x)
#x = Dense(16, activation="relu")(x)
#x = Dense(4, activation="relu")(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(i, x)
model.compile(optimizer=Adam(lr=learning_rate), loss="binary_crossentropy", metrics=["accuracy"])

I then train the model:

# train the model
# trains on first half of dataset, and tests on second half
b_size = 128
iterations = 50
r = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=iterations, batch_size=b_size)

Here is the validation curves:

Validation curves

Here is the accuracy curves:

Accuracy curves

As you can see the validation loss is very high, and the accuracy is not too high, which to me the model is failing.

What do these outputs mean? In the sense of with all the experimentation I still cannot get the validation loss low with high accuracy. This to me is that the model is incorrect. But what would be the correct model? Is there any advice that can be given to better understand how to move forward and build a better model?

Dean
  • 125
  • 4
  • 1
    Hello Dean, I confirm that your validation curves shouldn't be as bad as they are. I woudn't trust this value : `train_to_test_ratio = 0.33`, we usually split our database into 80 - 95 % for training and 20 - 5% for testing, so I'd rather pick 0.85. This may a reason for your results. Another thing I wonder is if using an ANN is not overkilling there, your problem is quite simple, so it may be interesting to use a KNN algorithm first and see what accuracy you reach with it to have an idea of how hard your task is (ANN should give similar or better accuracy as KNN algorithm). – Ubikuity May 07 '21 at 09:35
  • 1
    Is that the real dataset? It has five features and all binary. So you can have a max of 32 distinct instances. So, how are the 100K different to each other. – 10xAI May 08 '21 at 05:16
  • @Ubikuity thanks. I changed the test size as you mentioned, but it did not seem to change anything. I have also implemented the KNN algorithm as you suggested, but the precision is only .69 (weighted avg), recall is .69 and f1-score is .69. Thoughts? – Dean May 08 '21 at 08:31
  • @10xAI the real dataset is 5 features spanning a range of -30 to +30. I have classified negative values as 0 and positive values as 1 to try and simplify the model. Also worth mentioning is that I have tried building the ANN using the real numbers and scaling them using StandardScaler, however, again, the validation and accuracy curves are similar to the above. – Dean May 08 '21 at 08:34
  • 10xIA made a very good point, it can be interesting normalizing your data instead of setting it to binary as it seems 2 identical binary inputs can give different outputs, which is why you do not have a great accuracy with KNN or your ANN (I remind you that random classification with 2 classes gives 50% accuracy). May you try normalizing instead of setting to binary and tell us what your results look like ? – Ubikuity May 08 '21 at 12:14
  • If that is correct. Then please have a bigger model. 200 Neurons, 5-6 layers. The test set should be 25%-30% since you have a decent dataset size. Keep LR to default. Increase batch size a bit as it's still fluctuating. Data should be standardized. – 10xAI May 08 '21 at 17:03
  • Thanks @Ubikuity, I have adjusted the KNN model - previously scored .69 but it was on a small section of the dataset and had the default K value. I then tested to find the optimal K value (35) and it is returning 100% accuracy. WHich doesn't seem right. Should I go back to the ANN? I also standardised the data in the KNN, instead of having binary. – Dean May 09 '21 at 09:35
  • @10xAI do you mean 5-6 layers of 200 neurons each? Or do I increment/ decrement the deeper it goes? – Dean May 09 '21 at 09:42
  • @10xAI I made the ANN model 6 layers deep with a total of ~200 nodes (incremented into depth) and the model is now returning very high accuracy with very low loss. My challenge now I guess is to understand whether the model is overfitting! Thanks for the help. – Dean May 10 '21 at 15:18
  • I should overfit. You can check that with Train/Test accuracy plot. Try adding Regularization i.e. Dropout and bring the overfitting down. You can also trim the model. The 200/6 suggestion was a just gut feeling. – 10xAI May 10 '21 at 15:37
  • THanks @10xAI can you make your suggestion an answer so I can mark off this post. – Dean May 10 '21 at 17:13

1 Answers1

1

500K is decent data size. So, we can have a larger test set and also the network should be bigger.

Try the below listed suggestions -

  • Have a bigger model e.g. 200 Neurons, 5-6 layers
  • The test set should be 25%-30% since you have a decent dataset size
  • Keep LR to default
  • Increase batch size a bit as it's still fluctuating
  • Data should be standardized
  • Try adding Regularization when it starts overfitting
10xAI
  • 5,454
  • 2
  • 8
  • 24