1

I'm using Kaggle's titanic set. I'm using pieplines and I'm trying to prune my decision tree and for that I want the cost_complexity_pruning_path. The last line of code produces the error: ValueError: could not convert string to float: 'male' Do you know what I'm doing wrong? I have looked at Sklearn: applying cost complexity pruning along with pipeline but that doesn't seem to be helping in my case

cat_vars = ['Sex','Embarked']
num_vars = ['Age']

num_pipe = Pipeline([('imputer', SimpleImputer(strategy='mean')),('std_scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),('ohe', OneHotEncoder())])

col_trans = ColumnTransformer([('numerical', num_pipe, num_vars),('categorical', cat_pipe, cat_vars)] ,remainder='passthrough')

final_pipe = Pipeline([('column_trans', col_trans), ('tree', DecisionTreeClassifier(random_state=42))])
final_pipe.fit(X_train, y_train)

path = final_pipe.steps[1][1].cost_complexity_pruning_path(X_train, y_train)
user5744148
  • 113
  • 2

1 Answers1

1

Because cost_complexity_pruning_path refits the tree model on the data you provide before doing the pruning (source), you need to preprocess the data first. So this should do it:

X_preproc = final_pipe[:-1].transform(X_train)
path = final_pipe.steps[-1][1].cost_complexity_pruning_path(X_preproc, y_train)
Ben Reiniger
  • 11,094
  • 3
  • 16
  • 53
  • ...which I now see is basically the same answer as I provided in the linked question. Oh well. – Ben Reiniger Feb 23 '21 at 16:55
  • I tried your answer in the other question, but it didn't work, but using your answer here, it works fine. I guess the problem was that here I had more than one transformer before the tree which meant that I needed the final_pipe[:-1] instead of the final_pipe[-1] that I tried based on the question I linked to that you previously answered – user5744148 Feb 24 '21 at 12:28