Dealing with zeros when plotting log-scaled data

Question

I have a non-negative variable and I'd like to plot it, log-scaled

I'm trying to understand how to deal with 0-values. One naive idea I had in mind is just to add 1 to all values (or some very low number greater than 1)

What other options are available?

Thanks

WBM · Accepted Answer · 2021-02-24T15:55:03.587

2

Your suggestion is a valid one, encoding variables with a known outcome once the scaling is applied. Log(1) will become zero, so just keep that in mind for your next stage. You can use clip or replace for this:

df.clip(1, df.max())

or try replacing with a NaN

df.replace(0, np.nan)

Alternatively you could do one of the following:

Drop the zero value rows e.g. df = df[df['column'] !=0] but then you lose some data.
Fill the zero values with a statistically representative value (i.e. interpolation). You can explore the Pandas interpolate method here.

Which ever method you decide upon depends on your use-case, and compatibility with the plotting function you use.

edited Feb 24 '21 at 15:55

answered Feb 24 '21 at 15:36

WBM

691
5
16

The zero values in the original data are completely valid, useful, and meaningful data. I can't think of a reason why you'd want to just throw those data away when log scaling, as it will introduce significant bias in your log-scaled data. There also isn't any reason to interpolate new values, as you're again just throwing out valid data and overwriting it. Being unable to log-scale a raw value of zero doesn't mean that the raw value ought to be something else, which is what you'd accomplish by interpolating. Interpolation would act as if you didn't have any raw measurement *at all*. – Nuclear Hoagie Feb 24 '21 at 16:10
That's true, hence why I state it's dependent on the use-case. It's just a common missing data problem to have zeros, and where an interpolating method might be more useful. – WBM Feb 24 '21 at 16:17

Nuclear Hoagie · Answer 2 · 2021-02-24T16:12:36.340

For this type of issue, I typically add the reciprocal of the log base. For data that's being log10-scaled, this results in adding 0.1 to all values. For data that's being log2-scaled, this results in adding 0.5 to all values. This has the nice property of mapping all of your 0 values to -1 in the log scale, regardless of what log base you use. If your data are very small numerically, you may want to use a higher power of the reciprocal to avoid adding factors that will cause your actual values to vary by several fold. If the data are all between 0 and 0.01, for example, I might add a factor of 0.0001 when log10 scaling.

Dealing with zeros when plotting log-scaled data

2 Answers2