I´m studying the sklearn decision tree classifier and I´m having some trouble understanding the concept of pruning. From what I understand it consists in making the tree less deep in order to avoid overfitting. That can also be achieved by setting a max depth for the tree. How is using minimum cost complexity pruning on the tree in order to find the subtree that has the best accuracy different from making a grid search to test which tree depth is best? Would it even make sense to make a grid search that experiments with all possible max_depth values and also tests different ccp_alphas or would it be redundant? Code sample of the possibly redundant grid search I´m suggesting below:
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import GridSearchCV
ccp_alphas= clf.cost_complexity_pruning_path(X_train,y_train)["ccp_alphas"]
max_depth=clf.get_depth()
parameters=ParameterGrid({"max_depth":[[max_depth] for max_depth in range(1,max_depth +1)],"ccp_alpha":[[alpha] for alpha in ccp_alphas]})
grid_search=GridSearchCV(estimator = clf,param_grid=parameters,scoring='accuracy',cv=10,n_jobs=-1)
grid_search.fit(X_train,y_train)
best_accuracy=grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:",best_parameters )
Output : Best Accuracy: 82.12 % Best Parameters: {'ccp_alpha': 0.0, 'max_depth': 6}