I am working on building prediction model for disk failures (time taken to occur a disk failure and what parameters could strongly affect disk failures). I am bit confused on-
- What data preprocessing steps should I perform. The dataset is highly imbalanced (500 failures and ~40000 non-failures)
- What type of Machine Learning models should I take into consideration as data is highly imbalanced?
- Few days back, I read about Survival Analysis and now I am in conundrum whether the problem would be of Survival Analysis or Machine Learning?
I am currently working with dataset provided by BackBlaze(https://www.backblaze.com/b2/hard-drive-test-data.html).
It would be really great it I could get some direction :)