vendredi 9 février 2018

Pandas - Find and index rows that match row sequence pattern

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.

Code:

import pandas as pd
from numpy.random import choice, randn
import string

# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
                   'group_var': choice(list(string.ascii_uppercase), n_rows),
                   'row_pat': choice([0, 1, 2, 3], n_rows),
                   'values': randn(n_rows)})

# sorting 
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)

Which returns this: enter image description here

I can find the start of the pattern (with no grouping though) by this:

# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0 

# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)

df.head(10)

enter image description here

What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.

Appreciate any tips on how to solve this!

Thanks...

Aucun commentaire:

Enregistrer un commentaire