14

Let's say we are predicting the sales of a shop and my training data has two sets of features:

  • One about the store sales with the dates (the field "Store" is not unique)
  • One about the store types (the field "Store" is unique here)

So the matrix would look something like this:

+-------+-----------+------------+---------+-----------+------+-------+--------------+
| Store | DayOfWeek |    Date    |  Sales  | Customers | Open | Promo | StateHoliday |
+-------+-----------+------------+---------+-----------+------+-------+--------------+
|   1   |     5     | 2015-07-31 |  5263.0 |   555.0   |  1   |   1   |      0       |
|   2   |     5     | 2015-07-31 |  6064.0 |   625.0   |  1   |   1   |      0       |
|   3   |     5     | 2015-07-31 |  8314.0 |   821.0   |  1   |   1   |      0       |
|   4   |     5     | 2015-07-31 | 13995.0 |   1498.0  |  1   |   1   |      0       |
|   5   |     5     | 2015-07-31 |  4822.0 |   559.0   |  1   |   1   |      0       |
|   6   |     5     | 2015-07-31 |  5651.0 |   589.0   |  1   |   1   |      0       |
|   7   |     5     | 2015-07-31 | 15344.0 |   1414.0  |  1   |   1   |      0       |
|   8   |     5     | 2015-07-31 |  8492.0 |   833.0   |  1   |   1   |      0       |
|   9   |     5     | 2015-07-31 |  8565.0 |   687.0   |  1   |   1   |      0       |
|   10  |     5     | 2015-07-31 |  7185.0 |   681.0   |  1   |   1   |      0       |
+-------+-----------+------------+---------+-----------+------+-------+--------------+
[986159 rows x 4 columns]

and

+-------+-----------+------------+---------------------+
| Store | StoreType | Assortment | CompetitionDistance |
+-------+-----------+------------+---------------------+
|   1   |     c     |     a      |         1270        |
|   2   |     a     |     a      |         570         |
|   3   |     a     |     a      |        14130        |
|   4   |     c     |     c      |         620         |
|   5   |     a     |     a      |        29910        |
|   6   |     a     |     a      |         310         |
|   7   |     a     |     c      |        24000        |
|   8   |     a     |     a      |         7520        |
|   9   |     a     |     c      |         2030        |
|   10  |     a     |     a      |         3160        |
+-------+-----------+------------+---------------------+
[1115 rows x 4 columns]

The second matrix describes the store type, the assortment groups of item each of them sell and the distance from the nearest competitor store.

But in my test data, I only have information in the first matrix without the Customers and Sales fields. The aim is to predict the sales field given the

  • Store
  • DayofWeek
  • Date
  • Open (whether the store is open)
  • Promo (whether the store is having a promotion)
  • StateHoliday (whether it is a state holiday)

I can easily train a classifier based on the bulleted fields above to predict Sales but how can I make use of the second matrix in my training data that I would not get in test data?

Is it logical to assume that the second matrix about the Store types is static and I can easily join it to the test data?

What happens if there are holes in my test data feature set, let's say for some rows in the test data, I don't have the "Promo" values.

alvas
  • 2,340
  • 6
  • 25
  • 38
  • You know you can ask this on the Kaggle forum, and it is *already answered*: https://www.kaggle.com/c/rossmann-store-sales/forums/t/17137/how-to-handle-missing-customers-field-in-test-dataset/97276 and https://www.kaggle.com/c/rossmann-store-sales/forums/t/16730/two-questions-customers-and-scoring/95138 – Neil Slater Nov 18 '15 at 09:22
  • Oooo, pardon my kaggle noobiness. First time kaggle without anyone holding my hands =) – alvas Nov 18 '15 at 09:28
  • 1
    No problem. The `Customers` data is very specific to the competition. If you are not sure how to deal with missing values in general for ML (such as empty `Promo` values), it might be worth changing this question to be about that issue only. There are already some answers about that on this site, e.g. http://datascience.stackexchange.com/questions/8322/filling-missing-data-with-other-than-mean-values – Neil Slater Nov 18 '15 at 09:35

2 Answers2

6

Use the extra features for unsupervised learning. You might enjoy Vladimir Vapnik's take on this in the context of SVMs, which he calls privileged learning: Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer

KT12
  • 240
  • 2
  • 14
Emre
  • 10,481
  • 1
  • 29
  • 39
2

I think there might be a problem in the way you are stating the problem. You say that you test data doesn't have two fields, but that can not be correct.

You have to take all your data and split it into 2 groups, the training set and the test set. In a proportion of 80%-20% or 70%-30%. Then you train your algorithm with the data in the training set, and test the accuracy of the model with the data in the test set.

The accuracy you get is the probability that your model is correct. Or said in another way, the next time you use your model to predict a sale, the accuracy is the probability that your prediction is real

hoaphumanoid
  • 901
  • 7
  • 19