27

I am newbie on machine learning and keras and now working a multi-class image classification problem using keras. The input is tagged image. After some pre-processing, the training data is represented in Python list as:

[["dog", "path/to/dog/imageX.jpg"],["cat", "path/to/cat/imageX.jpg"], 
 ["bird", "path/to/cat/imageX.jpg"]]

the "dog", "cat", and "bird" are the class labels. I think one-hot encoding should be used for this problem but I am not very clear on how to deal it with these string labels. I've tried sklearn's LabelEncoder() in this way:

encoder = LabelEncoder()
trafomed_label = encoder.fit_transform(["dog", "cat", "bird"])
print(trafomed_label)

And the output is [2 1 0], which is different that my expectation output of somthing like [[1,0,0],[0,1,0],[0,0,1]]. It can be done with some coding, but I'd like to know if there is some "standard" or "traditional" way to deal with it?

Ethan
  • 1,625
  • 8
  • 23
  • 39
Dracarys
  • 393
  • 1
  • 3
  • 5

3 Answers3

19

Sklearn's LabelEncoder module finds all classes and assigns each a numeric id starting from 0. This means that whatever your class representations are in the original data set, you now have a simple consistent way to represent each. It doesn't do one-hot encoding, although as you correctly identify, it is pretty close, and you can use those ids to quickly generate one-hot-encodings in other code.

If you want one-hot encoding, you can use LabelBinarizer instead. This works very similarly:

 from sklearn.preprocessing import LabelBinarizer
 encoder = LabelBinarizer()
 transfomed_label = encoder.fit_transform(["dog", "cat", "bird"])
 print(transfomed_label)

Output:

[[0 0 1]
 [0 1 0]
 [1 0 0]]
Neil Slater
  • 28,338
  • 4
  • 77
  • 100
  • But how could hotencoding help you when you will try to predict a new color ? Maybe in your case you have to retrain the model. Do you have any solution ? – gtzinos Dec 27 '17 at 12:25
  • @gtzinos: That looks like a different question. Perhaps ask it on the site. If you do, make clear whether you are concerned about NN predicting a brand new item (not seen in training data, but logically should happen on new inputs), or adding new classes on the fly when they are encountered in online training data. – Neil Slater Dec 28 '17 at 18:32
  • 1
    Pay attention that `LabelBinarizer()` breaks when there are 2 categories only: see https://stackoverflow.com/questions/31947140/sklearn-labelbinarizer-returns-vector-when-there-are-2-classes and https://stackoverflow.com/questions/48074462/scikit-learn-onehotencoder-fit-and-transform-error-valueerror-x-has-different. – gented Jan 20 '20 at 10:28
0

With the imagegenerator feature in keras we can leverage that directly giving a sample code:

datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    featurewise_center=True,
    featurewise_std_normalization=True,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    validation_split=0.2)


img_size=128
train_generator = datagen.flow_from_directory('train',
                                                    target_size=(img_size, img_size),
                                                    subset='training',
                                                    batch_size=32)
X, y = next(train_generator)

print('Input features shape', X.shape)
print('Actual labels shape', y.shape)

The other advantage of using this is that when we do prediction on a new file then we can use train_generator.class_indices to map back labels from the prediction to actual string names.

Ethan
  • 1,625
  • 8
  • 23
  • 39
Vivek
  • 77
  • 2
0

Also you can use sparse_categorical_crossentropy as loss function, and then you don't need onehot-encoding.
sample code: model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

more info at Keras website

ebrahimi
  • 1,277
  • 7
  • 20
  • 39
Mo Abbasi
  • 1
  • 1