10

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
 col1 col2  start
0  A>G  TCT   1000
1  C>T  ACA   2000
2  C>T  TCA   3000
3  G>T  TCA   4000
4  C>T  GCT   5000
5  A>G  ACT   6000
6  A>G  CTG  10000
7  A>G  ATG  20000
8  C>A  TCT  10000
9  C>T  ACA   2000
10 C>T  TCA   3000
11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:
 type length  diff
0  C>T  2   1000
1  A>G  3   14000
2  C>T  3   2000
burcak
  • 203
  • 1
  • 2
  • 4

1 Answers1

12

Break col1 into sub-groups of consecutive strings. Extract first and last entry per sub-group.

Something like this:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

df['subgroup'] = (df['col1'] != df['col1'].shift(1)).cumsum()


  col1 col2  start  subgroup
0  A>G  TCT   1000         1
1  C>T  ACA   2000         2
2  C>T  TCA   3000         2
3  G>T  TCA   4000         3
4  C>T  GCT   5000         4
5  A>G  ACT   6000         5
6  A>G  CTG  10000         5
7  A>G  ATG  20000         5


df.groupby('subgroup',as_index=False).apply(lambda x: (x['col1'].head(1),
                                                        x.shape[0],
                                                        x['start'].iloc[-1] - x['start'].iloc[0]))

0        ([A>G], 1, 0)
1     ([C>T], 2, 1000)
2        ([G>T], 1, 0)
3        ([C>T], 1, 0)
4    ([A>G], 3, 14000)

Tweak as needed.

UPDATE: for pandas 1.1+ replace the last part with:

def func(x):
    result = {"type":x['col1'].head(1).values[0], "length": x.shape[0], "diff": x['start'].iloc[-1] - x['start'].iloc[0]}
    return pd.Series(result, name="index")
df.groupby('subgroup',as_index=False).apply(func)
oW_
  • 6,254
  • 4
  • 28
  • 45
  • Thanks for your solution. shift(1) shifts the column one row down but how does the cumsum works after this point? I did not understand that part. Could you please explain? Thanks. – burcak Nov 19 '18 at 22:26
  • The first part is `True` whenever a new subgroup starts, `.cumsum()` simply adds (or counts) all the `True` values up to this point (`True` and `False` can be treated as `1` and `0`). Since there is exactly one `True` per subgroup this gives the result. – oW_ Nov 19 '18 at 22:44
  • Thanks @oW_ How do we return a dataframe and other columns of df that are after start column? – burcak Nov 19 '18 at 23:10
  • you can add them to the tuple and unpack it. if you need it in a different way I suggest asking a question on stackoverflow. this is not really data science related. – oW_ Nov 19 '18 at 23:27
  • Can use `.ne()` method instead of `!=` – smci Jan 07 '20 at 07:24
  • 2
    @oW_ I don't believe this works anymore, at least in my environment. I get an error `ValueError: 1 columns passed, passed data had 3 columns`. – d84_n1nj4 Dec 07 '20 at 15:18
  • Thanks for pointing that out. I think the update should work in the newer pandas versions as well. – oW_ May 17 '22 at 15:51