2

I have a df with many columns that represent the market cap of companies that compose an index. The index of the dataframe is dates. Before the company enters the index or after it leaves it, the market cap of the company in the df is zero.

I want to know the mean number of days in which the index doesn't change.

for example:

[in]: 
df = pd.DataFrame(np.array([[1, 1,np.nan], [np.nan,2, 10], [1,3, 100],[4,np.nan, 100]]), 
                           columns=['a', 'b','c'])

df
[out]:  
        a   b   c    
    0   1.0 1.0 NaN    
    1   NaN 2.0 10.0    
    2   1.0 3.0 100.0    
    3   4.0 NaN 100.0

what I want to know is how many rows have entries from the same column.

For example, row 0 has entries from columns a and b. row 1 from columns b and c. row 2 from columns a,b and c. an row 3 from columns a and c. therefore there are 4 rows with unique column combinations and 4 changes. The mean is then 1.

bbarroso
  • 21
  • 3
  • I have not come across a function in Pandas that can do what you are asking. But what you could do is generate a hashcode based on the presence of values in different columns as another column. Then you could could find the rows which have unique hashcode. – Nischal Hp Feb 06 '18 at 13:57

2 Answers2

2

By building a tuple of whether a value is nan or not, you can then construct each unique like:

Code:

def get_average_run_length(a_df):
    unique_run_lengths = set(
        tuple(not np.isnan(x) for x in row[1]) for row in df.iterrows())
    return len(a_df) / len(unique_run_lengths)

Test Code:

df = pd.DataFrame(
    [[1, 1, np.nan], [np.nan, 2, 10], [1, 3, 100], [4, np.nan, 100]],
    columns=list('abc'))

print(df)
print(get_average_run_length(df))

# double the last row
df = pd.concat([df, df.tail(1)])
print(df)
print(get_average_run_length(df))

Results:

     a    b      c
0  1.0  1.0    NaN
1  NaN  2.0   10.0
2  1.0  3.0  100.0
3  4.0  NaN  100.0

1.0

     a    b      c
0  1.0  1.0    NaN
1  NaN  2.0   10.0
2  1.0  3.0  100.0
3  4.0  NaN  100.0
3  4.0  NaN  100.0

1.25
Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
1

I am not entirely sure if I understand your question correctly. What result do you want for the following dataframe: 2/5 because there are 2 different NaN-combinations in 5 rows, or 4/5 because there are 4 changes in the index composition in 5 rows?

    a   b
0   1   NaN
1   NaN 2
2   3   NaN
3   NaN 4
4   5   NaN

Stephen Rauch's answer covers the first scenario. This one should cover the second:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 2, np.nan],
                            [3, 4, np.nan],
                            [6, np.nan, 7],
                            [8, np.nan, np.nan]]))

df_na = np.array(df.isna()[:-1])
df_na_shifted = np.array(df.isna()[1:])

number_of_good_cols = (df_na == df_na_shifted).sum(axis=1)

number_of_good_rows = sum(number_of_good_cols == df.shape[1])
print('The index changed on %d days.' % (df.shape[0] - 1 - number_of_good_rows))
Elias Strehle
  • 1,636
  • 9
  • 25