3

I am creating a model using xgboost. Regarding its parameters, its objective is survival:cox and its eval_metric is cox-nloglik. The output Y ranges from -800 to 800. However, the predicted values are way to large (in range from 10^3 to 10^13). Why is this? What is the outcome of a Cox regression in xgboost?

Kush Patel
  • 151
  • 1
  • 5

1 Answers1

9

In the documentation, you can find that the predictions are returned on the hazard ratio scale:

survival:cox Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).

In other words, in a Cox proportional hazards rate model we have:

$h(t)=h_0(t)×X$

where $X$ in a traditional linear model is of the form $exp(b_1x_1+b_2x_2+...+b_px_p)$, and $h_0(t) = $ baseline hazard rate function. In the xgboost case with tree base learners, $X$ is the exponentiated prediction generated by the weighted sum of individual trees trained on the negative gradient of a loss function designed for survival data in particular.

In other words, the predicted values are not times of failure.

I find the xgboost documentation on the survival:cox setting to be extremely sparse and not well described. To my knowledge, I do not believe there exists an internal method to derive $h_0(t)$ for example.

aranglol
  • 1,904
  • 7
  • 12