What is an intuitive explanation for the log loss cost function?

Question

I would really appreciate if someone could explain the log loss cost function And the use of it in measuring a classification model performance.

I have read a few articles but most of them concentrate on mathematics and not on intuitive explanation and also a basic implementation using python with a small dataset would really helpful, So I can understand it better.

It would really help many looking for the same here. Thanks.

score 10 · Accepted Answer · edited Jan 09 '19 at 16:05

Ok, so here is how it works. Say you want to classify animals, and you have cats, dogs and birds. This means that your model will output 3 units in the form of a vector (call it a list if you prefer).

Each element of the list represents and animal, so for example

Position 0 represents how likely the input is to be a cat
Position 1 represents how likely the input is to be a dog
Position 2 represents how likely the input is to be a bird.

Now imagine you get an input that is a bird, in a happy world, your algorithm should output a vector like

[0, 0, 1]

That is, the input has 0% chance of being a cat, a 0% chance of being a dog and a 100% chance of being a bird.

I reality, this is not so simple, and most likely your output would be something like this

[0.15, 0.1, 0.75]

This means 15% chance of being a cat, 10% chance of being a dog and 75% chance of being a bird. Now, notice that this means your algorithm will still consider the input as a bird, so in terms of classification, sure the output would be correct... but it would not be as correct as it had predicted 100% chance of a bird.

So, the intuition is that the logloss measures how far away you are from perfection, where perfection would be identifying the correct label with a 100% chance and the incorrect labels with a 0% chance.

Final word of advice: Do NOT be afraid of math, you will really need to get the grasp of it at some point, do not let the sum terms intimidate you, after all they just represent loops in programming.

UPDATE

Let's dive into the math, specially to demystify it.

The logloss formula is given by

$$ LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log(\hat{y_i}) + (1-y_i) \cdot log(1-\hat{y_i}) ] $$

Where

$n$ represents the number of examples, in our case i will use 2 examples.
$y_i$ represents the correct answer for example $i$
$\hat{y}_i$ represents our prediction for example $i$

So, for our example we have two examples, (remember that the examples are denoted by $y_i$) which are

[0, 1, 0] # Example 1: This means the correct answer is dog    
[1, 0, 0] # Example 2: This means the correct answer is cat

Now, let's go for our predictions, remember that the predictions are denoted by $\hat{y}$ let's say they are

[0.1,  0.6,  0.3]
[0.85, 0.05, 0.1]

And let's apply the scary formula here. First notice that $$\sum\limits_{i=1}^n$$ just means sum all the elements from i=1 to $n$, in our case, $n$ is the number of examples, and we have two of those.

So, for $i=1$ we have

$[y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]$

For $i=1$

term1 = [0, 1, 0] * log([0.1, 0.6, 0.3]) + (1-[0, 1, 0]) * log(1 - [0.1, 0.6, 0.3])

For $i=2$

term2 = [1, 0, 0] * log([0.85, 0.05, 0.1]) + (1-[1, 0, 0]) * log (1-[0.85, 0.05, 0.1])

And finally we have

log_loss = (-1/2) * (term1 + term2)

Using sklearn log_loss, the answer is 0.3366

Now, do not get too lost in the math here, just notice that it is not THAT hard, and also do understand that the loss function here tells you essentially how wrong you are, or if you prefer, it measures the "distance" from perfection. I strongly recommend you to code the logloss by yourself (normally numpy is a good option to do so :)

score 2 · Answer 2 · answered Jan 09 '19 at 19:21

The accepted answer is simply amazing. I would love to add a small intuitive detail to it. I hope it makes things even better.

I begin from the loss function as described by @Juan:

LogLoss = - $\frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log(\hat{y_i}) + (1-y_i) \cdot log(1-\hat{y_i})$]

Consider this equation as being made up of two separate functions:

$-log(\hat{y_i})$
$log(1-\hat{y_i})$

Consider two classes. $y = 1$ and $y = 0$. Now the first function looks like this:

This is the loss function (and by loss function means I am analysing over one single example) for y = 1. Simply put, this means the loss is zero if you predict y = 1 when it is actually one and the loss is INFINITY if otherwise (if you predict y = 0 when y = 1). In the latter case, you are heavily penalising your loss function for extremely wrong predictions.

On the other hand, the second function is loss function for y = 0:

Using similar reasoning, you can get an intuition that this function is being heavily penalised when it predicts y = 1 (and the actual label is y = 0). And vice-versa is true as well.

The equation @Juan gave is a combination of these two loss functions. Let me unpack that a bit.

$\frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log(\hat{y_i}) + (1-y_i) \cdot log(1-\hat{y_i})$]

$y_i$ and $(1 - y_i)$ are present to collapse this equation into one of the above functions. Say if y = 1, then $(1 - y_i)$ yields 0 and $log(1-\hat{y_i})$ vanishes. Only $log(\hat{y_i})$ remains (which is actually what we had wanted, considering the graphs listed above).

The sigma and division by number of training samples is done to convert this loss function into a well-defined cost function (considered over the entire training set at once).

I hope this helps in better intuition into this logloss function.

What is an intuitive explanation for the log loss cost function?

2 Answers2

UPDATE