2

I am trying to build machine learning models (GBM, RF, Staking) on top of a dataset that is about 3G in size on my local computer. However, I only have 4G memory (only 2G are available).

My question is : is it logical to split the whole data on 20% for training set, 10% on validation set, and the 70% for the testing part? I split also the test set on 7 equal subsets with same distributions. I am doing this due to the fact I cannot test a model on full dataset.

Still I am not really convinced about this solution, and I am not sure that it is good enough to get a robust final model. What can I do? I am new to machine learning and big data.

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
Safa
  • 21
  • 1
  • what are the largest data sizes you have tested with your current lap-top? I think you can do some stuff with the configs you have but you need to make sure you don't exceed certain limits. Have you tried spark? – Wajdi Apr 22 '18 at 12:28

0 Answers0