Split large dataset for predictive modeling using rsparkling -sparklyr

Asked Apr 21 '18 at 16:07

Active Apr 21 '18 at 19:18

Viewed 52 times

I am trying to build machine learning models (GBM, RF, Staking) on top of a dataset that is about 3G in size on my local computer. However, I only have 4G memory (only 2G are available).

My question is : is it logical to split the whole data on 20% for training set, 10% on validation set, and the 70% for the testing part? I split also the test set on 7 equal subsets with same distributions. I am doing this due to the fact I cannot test a model on full dataset.

Still I am not really convinced about this solution, and I am not sure that it is good enough to get a robust final model. What can I do? I am new to machine learning and big data.

edited Apr 21 '18 at 19:18

Stephen Rauch

1,783
11
21
34

asked Apr 21 '18 at 16:07

Safa

what are the largest data sizes you have tested with your current lap-top? I think you can do some stuff with the configs you have but you need to make sure you don't exceed certain limits. Have you tried spark? – Wajdi Apr 22 '18 at 12:28

Split large dataset for predictive modeling using rsparkling -sparklyr

0 Answers0