bias and variance trade off related question

Question

I am having difficulty to understand the expected squared errors formula in this website:

$y=f(x)+e$ true regression line

$\hat{y}=\hat{f}(x)$ your estimated regression line

$error(x)=\bigg(\mathbb{E}[\hat{f}(x)]-f(x)\bigg)^2+\mathbb{E}\bigg[(\hat{f}(x)-\mathbb{E}[\hat{f}(x)]\bigg]^2+var(e) =bias^2+variance+\text{irreducible error}$

my question is what is the best way to understand bias and variance from this mathematical equation?

my understanding of variance here is:

assume you collect a sample with 1000 records and you generate regression line based on this sample. You get a model: $\hat{f}(x)=1+2x$

so at a given x value, for example: $x=1$ then $\hat{f}(x)=1+2*1=3$

then you re-sample another 1000 records and generate regression line:

$\hat{f}(x)=2+3x$

when $x=1$, $\hat{f}(x)=5$

you repeat this process for N times (assume N is a large number and close to infinity)

then your $E[\hat{f}(x=1)]$ is $(3+5+\dots+N)/N$

assume your $E[\hat{f}(x=1)]$ is $5.5$

once you have calculated $\mathbb{E}[\hat{f}(x)]$ for a given value of x=1, you should be able to compute $\mathbb{E}\bigg[\hat{f}(x)-\mathbb{E}[\hat{f}(x)])^2\bigg]$

$((3-5.5)^2+(5-5.5)^2+...)/(n-1)$

my interpretation above is focusing on variance of $\hat{Y}(x)$ at a each given $x$-point(re sampling to get $\hat{y}(x)$)

but my instinct tells me I should calculate based on only one set of data (1000 records) and calculate $\mathbb{E}[\hat{Y}(x)]$ at different x point.

please let me know which one of interpretation is correct or if both are wrong, please explain using this formula how I should interpret the terms in this formula.

Your first interpretation is correct. – Valentas Aug 22 '19 at 07:00 — Valentas, Aug 22 '19 at 07:00

Nikos M. · Answer 1 · 2021-01-05T07:48:06.170

The bias of an estimator $\hat{f}(x)$ of $f(x)$ is by definition:

$$Bias(\hat{f},f)(x)=E[\hat{f}(x)]-f(x)$$

which is some definite (not-random) function $\phi(x)$.

Why this is so? The intuitive way to understand it is that estimation itself is a random process. Given random data will produce a random estimation. However the expected estimation is $E[\hat{f}(x)]$, so it is only natural to define bias as the difference between expected estimation and true value.

In case the bias of the estimator $\hat{f}(x)$ (for $f(x)$) is zero, in other words:

$$E\left[\hat{f}(x)\right] = f(x)$$

Then the estimator is called unbiased. It has no (systematic) bias in the estimation process and the expected estimation coincides with the true value. If furthermore, this estimator has minimum possible variance is called minimum variance unbiased estimator. All these are part of the basic theory of statistics.

The linked tutorial you mention simply computes the total Mean Square Error (MSE) by the various components it can be split into (a general theorem for any type of estimation process):

MSE = (irreducible noise error) + (estimator variance) + (bias squared)

or in other words:

$$MSE(x) = Var[\epsilon] + Var[\hat{f}(x)] + Bias(\hat{f},f)(x)^2$$

The estimator variance is the same formula as for any other random variable and is certainly part of the total MSE.

Derivation:

Suppose $y = f(x) + \epsilon$, $E[\epsilon] = 0$, $Var[\epsilon] = \sigma^2$
For any random variable $r$ we have $E[r^2] = Var[r] + E[r]^2$ (by the definition of variance).
For any non-random value $c$ we have $E[c] = c$, $Var[c] = 0$
$E[y] = E[f(x) + \epsilon] = E[f(x)] + E[\epsilon] = E[f(x)] = f(x)$
Similarly $Var[y] = Var[\epsilon]$ (expand the definition and use previous facts)
Since $\hat{f}$ and $\epsilon$ can be taken as independent, we can write the total error as:

${\displaystyle {\begin{aligned} \operatorname {E} {\big [}(y-{\hat {f}})^{2}{\big ]} &=\operatorname {E} {\big [}(f+\varepsilon -{\hat {f}})^{2}{\big ]}\\[5pt] &=\operatorname {E} {\big [}(f+\varepsilon -{\hat {f}}+\operatorname {E} [{\hat {f}}]-\operatorname {E} [{\hat {f}}])^{2}{\big ]}\\[5pt] &=\operatorname {E} {\big [}(f-\operatorname {E} [{\hat {f}}])^{2}{\big ]}+\operatorname {E} [\varepsilon ^{2}]+\operatorname {E} {\big [}(\operatorname {E} [{\hat {f}}]-{\hat {f}})^{2}{\big ]}+2\operatorname {E} {\big [}(f-\operatorname {E} [{\hat {f}}])\varepsilon {\big ]}+2\operatorname {E} {\big [}\varepsilon (\operatorname {E} [{\hat {f}}]-{\hat {f}}){\big ]}+2\operatorname {E} {\big [}(\operatorname {E} [{\hat {f}}]-{\hat {f}})(f-\operatorname {E} [{\hat {f}}]){\big ]}\\[5pt] &=(f-\operatorname {E} [{\hat {f}}])^{2}+\operatorname {E} [\varepsilon ^{2}]+\operatorname {E} {\big [}(\operatorname {E} [{\hat {f}}]-{\hat {f}})^{2}{\big ]}+2(f-\operatorname {E} [{\hat {f}}])\operatorname {E} [\varepsilon ]+2\operatorname {E} [\varepsilon ]\operatorname {E} {\big [}\operatorname {E} [{\hat {f}}]-{\hat {f}}{\big ]}+2\operatorname {E} {\big [}\operatorname {E} [{\hat {f}}]-{\hat {f}}{\big ]}(f-\operatorname {E} [{\hat {f}}])\\[5pt] &=(f-\operatorname {E} [{\hat {f}}])^{2}+\operatorname {E} [\varepsilon ^{2}]+\operatorname {E} {\big [}(\operatorname {E} [{\hat {f}}]-{\hat {f}})^{2}{\big ]}\\[5pt] &=(f-\operatorname {E} [{\hat {f}}])^{2}+\operatorname {Var} [\varepsilon ]+\operatorname {Var} {\big [}{\hat {f}}{\big ]}\\[5pt]&=\operatorname {Bias} [{\hat {f}}]^{2}+\operatorname {Var} [\varepsilon ]+\operatorname {Var} {\big [}{\hat {f}}{\big ]}\\[5pt]. \end{aligned}}}$

Why MSE? It is a smooth suitable metric for the accuracy which has nice analytic properties. Other types of error are also possible.

Further reading (and derivations and references therein):

bias and variance trade off related question

1 Answers1

Linked