22

If I train my model using the following code:

import xgboost as xg
params = {'max_depth':3,
'min_child_weight':10,
'learning_rate':0.3,
'subsample':0.5,
'colsample_bytree':0.6,
'obj':'reg:linear',
'n_estimators':1000,
'eta':0.3}

features = df[feature_columns]
target = df[target_columns]
dmatrix = xg.DMatrix(features.values,
                     target.values,
                     feature_names=features.columns.values)
clf = xg.train(params, dmatrix)

it finishes in about 1 minute.

If I train my model using the Sci-Kit learn method:

import xgboost as xg
max_depth = 3
min_child_weight = 10
subsample = 0.5
colsample_bytree = 0.6
objective = 'reg:linear'
num_estimators = 1000
learning_rate = 0.3

features = df[feature_columns]
target = df[target_columns]
clf = xg.XGBRegressor(max_depth=max_depth,
                min_child_weight=min_child_weight,
                subsample=subsample,
                colsample_bytree=colsample_bytree,
                objective=objective,
                n_estimators=num_estimators,
                learning_rate=learning_rate)
clf.fit(features, target)

it takes over 30 minutes.

I would think the underlying code is nearly exactly the same (i.e. XGBRegressor calls xg.train) - what's going on here?

user1566200
  • 315
  • 1
  • 2
  • 8

1 Answers1

33

xgboost.train will ignore parameter n_estimators, while xgboost.XGBRegressor accepts. In xgboost.train, boosting iterations (i.e. n_estimators) is controlled by num_boost_round(default: 10)

In your case, the first code will do 10 iterations (by default), but the second one will do 1000 iterations. There won't be any big difference if you try to change clf = xg.train(params, dmatrix) into clf = xg.train(params, dmatrix, 1000),

References

http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train

http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor

Icyblade
  • 4,246
  • 1
  • 22
  • 34