0

I am using pandas and am dealing with time series of sales.

What I would like to do is to remove the columns where a certain number of consecutive zeros appear, since forecasting for sparse series or repeated zero values tend to be unreliable. My dataframe has a bit more than 4,400 time series for a time span of 5 years.

I have tried applying this strategy from a previous SO question, but I cannot drop the columns fulfilling the condition.

Below is an example where the threshold would be 12. Any column having 12 or more consecutive zeros would be discarded.

Input:

      A   B   C

0     1   1   8
1     2   0  14
2     3   0  20
3     0   0  15
4     0   0  23
5     0   0  25
6     0   0  22
7     0   0  18
8     0   0  16
9     0   0  14
10    0   0   8
11    1   0  10
12   18   0   7
13   34   0   4
14  110   1  14
15   11  30   2
16    0  24   8
17    0  11   7
18    1  22  11
19    0  33   3
20    0  90  12
21   12  32  19
22   11  90  17
23   77  13   2

Desired output:

      A   C

0     1   8
1     2  14
2     3  20
3     0  15
4     0  23
5     0  25
6     0  22
7     0  18
8     0  16
9     0  14
10    0   8
11    1  10
12   18   7
13   34   4
14  110  14
15   11   2
16    0   8
17    0   7
18    1  11
19    0   3
20    0  12
21   12  19
22   11  17
23   77   2

Thanks.

Ethan
  • 1,625
  • 8
  • 23
  • 39
Nathan Furnal
  • 265
  • 1
  • 3
  • 10

4 Answers4

0
def max0(sr):
    return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([0,0,0,0]))

Output:
> 4
max0(pd.Series([1,0,0,0,0,2,3]))
Output:
> 4
max0(pd.Series([0]))
Output:

> 1
max0(pd.Series([1]))
Output:

> 0

fuwiak
  • 1,355
  • 8
  • 13
  • 26
RELW
  • 1
0

You can try rolling().sum:

thresh = 12

to_drop = df.eq(0).rolling(thresh).sum().eq(thresh).any()
df.loc[:, ~to_drop]

Output:

      A   C
0     1   8
1     2  14
2     3  20
3     0  15
4     0  23
5     0  25
6     0  22
7     0  18
8     0  16
9     0  14
10    0   8
11    1  10
12   18   7
13   34   4
14  110  14
15   11   2
16    0   8
17    0   7
18    1  11
19    0   3
20    0  12
21   12  19
22   11  17
23   77   2
Quang Hoang
  • 131
  • 2
0

An alternative to archived what you are looking for is:

threshold = 12

drops = [l[0] for l in list(filter(lambda x: x[1] > threshold,[(col, (df.groupby((df[col] != 0).cumsum()).cumcount()).max()) 
 for col in df.columns]))]

df.drop(drops, axis = 1)

Outputs:

    A   C
0   1   8
1   2   14
2   3   20
3   0   15
4   0   23
5   0   25
6   0   22
7   0   18
8   0   16
9   0   14
10  0   8
11  1   10
12  18  7
13  34  4
14  110 14
15  11  2
16  0   8
17  0   7
18  1   11
19  0   3
20  0   12
21  12  19
22  11  17
23  77  2
Multivac
  • 2,784
  • 2
  • 8
  • 26
-1

Short answer:

# Max number of zeros in a row
threshold = 12

# 1. transform the column to boolean is_zero
# 2. calculate the cumulative sum to get the number of cumulative 0
# 3. Check how much of each count you get and remove 0 counts
# 4. Get the maximum number of cumulative zeros
# 6. Drop the columns with more than your threshold

to_remove = []
for col in df.columns:
    count_zeros = (df[col] == 0).cumsum().value_counts().drop(0)
    if len(count_zeros) and count_zeros.idxmax() > threshold:
        to_remove.append(col)

df = df.drop(to_remove, axis=1)

You're welcome

qmeeus
  • 1,239
  • 1
  • 10
  • 13