4

I've trained an AWS Machine Learning model with the training data from here : https://www.kaggle.com/c/titanic/data

I'm now trying to run a batch prediction with the test data from the same source but I get the following error when I try to load the data : " The schema in this data file must match the datasource used to create the ML model ml-xxxxxxxxx. Ensure that the data file you are using matches the schema structure."

The schema, as far as I can see, is identical. I have tried it with and without the 'survived' column which is the value I'm trying to predict. I even tried it with the same training set which obviously has an identical schema and got the same error.

What am I doing wrong?

Monty
  • 41
  • 1

2 Answers2

1

I ran into the same problem today, tried googling for people having the same issue and found your question.

I solved my problem by creating the data source first and then running the prediction from there. So, instead of selecting the following option,

Batch Predictions > Create new batch prediction > ML model for batch prediction > My data is in S3, and I need to create a datasource

which fails, I first did:

Datasources > Create a new datasource...

Next, I ran the batch prediction from an existing datasource successfully.

Stereo
  • 1,403
  • 9
  • 24
max
  • 11
  • 2
0

One common reason that the schemas do not match up is that if you use the AML service to infer the attributes. I just found this as the root cause in my two data sets. In my test file, several attributes were inferred to be Numeric or Binary - when they were the opposite. Be sure to use the schema from your training data set to check the inferred schema of subsequent (test, eval, etc.) data sources.