2

I recently wrote this as a list of what I have seen / can think of as problem sources that make it hard to reproduce (replicate?) an experiment. I think I have seen most of them, except for hardware errors.

My thought was this: A typical computer vision experiment might run a training on a GPU for many hours / several days. Even if a bit-flip happens only once every $10^9$ FLOPs, that would be 11500 errors per second with an Nvidia Titan GTX 1080 Ti. I don't know how this error would affect later calculations (how the problem is numerically conditioned).

So: Are there any reports on Hardware errors affecting experiments?

(Blog posts, journal articles, posters?)

Martin Thoma
  • 18,630
  • 31
  • 92
  • 167

1 Answers1

1

Memory corruption seems to be an issue that is important enough for companies to buy expensive ECC memory for their clusters. The Wikipedia article on ECC memory lists some causes for memory corruption, including (to my surprise and delight) cosmic rays and ingenious hackers.

Elias Strehle
  • 1,636
  • 9
  • 25
  • Just because there is a produce, it doesn't mean there is a need. See [Homeopathy](https://en.wikipedia.org/wiki/Homeopathy), [Horoscopes](https://en.wikipedia.org/wiki/Horoscope) and many more. – Martin Thoma Feb 06 '18 at 05:29