Machine Learning or Survival Analysis?

Question

I am working on building prediction model for disk failures (time taken to occur a disk failure and what parameters could strongly affect disk failures). I am bit confused on-

What data preprocessing steps should I perform. The dataset is highly imbalanced (500 failures and ~40000 non-failures)
What type of Machine Learning models should I take into consideration as data is highly imbalanced?
Few days back, I read about Survival Analysis and now I am in conundrum whether the problem would be of Survival Analysis or Machine Learning?

I am currently working with dataset provided by BackBlaze(https://www.backblaze.com/b2/hard-drive-test-data.html).

It would be really great it I could get some direction :)

Its clearly a survival analysis problem because the data is time to failure, with (I guess) censoring when drives have run for some time without failure. ML is just another tool you could use for survival analysis. Your question should be "Machine Learning or Classical Maximum Likelihood or Bayesian methods for Survival Analysis?" Do a lit search and read stuff like: http://www.sciencedirect.com/science/article/pii/S0933365700000531 — Spacedman, Aug 23 '16 at 11:18
@Spacedman: (Usually I agree with you.) Actually that dataset does not appear to contain the time that drives were put into service for either the survivor or the decedents, so you cannot really calculate a time to failure. The data is organized into csv files that each contain one days worth of status on drives identified by serial number and model number. One drive might appear in many successive days. Examine how the predictors evolve for single drives. The data contains SMART indicators of drive health, so the task would be to see whether you would get "warnings" of impending failure. — 42-, Jun 27 '22 at 23:25

score 3 · Answer 1 · answered Aug 23 '16 at 05:41

Some algorithms, such as SVM or Logistical regression, have possibility to add a weight to certain class, therefore fix the unbalanced issue.

This really sounds like a job for Survival analysis, which is especially designed to answer questions like "When machine X fail" or "Which attribute influence the most the failure". You can simply start by plotting the Kaplan-Meier curve and then further stratify it by some attribute. Then you can try Cox regression model - it is useful to see the influence of an attribute on survival - the hazard ratio. But don't forget to verify the assumptions (functional form and proportional hazard).

In R the Survival analysis is implemented very well, so don't be affraid. There is simple and short tutorial which might help.

score 0 · Answer 2 · answered Jul 22 '16 at 00:11

You could think about rebalancing the dataset using undersampling of the majority class, oversampling the minority dataset or by applying SMOTE to the dataset.
If you've got the dataset more balanced then either Logistic Regression / Random Forests aren't a bad starting place. Random Forests I believe are better at handling unbalanced classification problems, but they might still have troubles with the degree of unbalance that you're talking about.
Survival analysis generally has an aspect of time to the problem (i.e: when might a hard drive fail?) so if time is included in your data then you definitely could frame it as a survival analysis problem. However if time doesn't enter the problem then it might be easier to just analyse as a ML problem.

Machine Learning or Survival Analysis?

2 Answers2