dimanche 8 mars 2020

Extract rows based on patterns in data frame

I am learning Python and was working on one of the dataset like this:

**Col1**                                 **Col2**      **Col3**      **Col4**  
dog                                        Z             st02          0
dog,cat                                    Z             st02          1
dog,bat,cat                                Z             st02          2
bat,cat,elephant                           Y             st02          2
dog,bat,cat,elephant                       Y             st02          3
tiger                                      Z             st01          0
lion,leopard,cheetah                       Z             st01          2
tiger,lion,leopard,cheetah                 Z             st01          3
dog,tiger,cheetah                          Y             st01          2
dog,tiger,leopard,cheetah                  Y             st01          3
eagle,jaguar,Kangaroo,zebra                Z             st02          3
cheetah,eagle,jaguar,Kangaroo,zebra        Z             st02          4

The expected output is:

**Col1**                                 **Col2**       **Col3**      **Col4**
dog,bat,cat                                Z              st02          2
dog,bat,cat,elephant                       Y              st02          3
tiger,lion,leopard,cheetah                 Z              st01          3
dog,tiger,leopard,cheetah                  Y              st01          3
cheetah,eagle,jaguar,Kangaroo,zebra        Z              st02          4

In order to extract the above rows as output, I tried tracing the patterns and using the below logic:

data = pd.read_excel("data.xlsx")
data['Col4'] = data['Col1'].str.count(',')
v1 = []
v2 = []
v1.append(0)
v2.append(0)
for i in range(0,data.shape[0]-1):
    x = data['Col_2'][i]
    y = data['Col_2'][i+1]
    t1 = data['Col_3'][i]
    t2 = data['Col_3'][i+1]
    g1 = (x == y) & (t1==t2)
    d1 = data['Col_1'][i]
    d2 = data['Col_1'][i+1]
    c1 = data['Col_4'][i]
    c2 = data['Col_4'][i+1]
    flag = 0
    if(all(x in d2 for x in d1)):
      flag = 1
    g2 = (flag == 1)&(c2>c1)
    v1.append(g1)
    v2.append(g2)
    data['new_cond1'] = v1   
    data['new_cond2'] = v2   
    data['Final_flag'] = (data['new_cond1']==True)&(data['new_cond2']==True) 
    data_output = data[data['Final_flag']==True]  

But I didn't end up getting the expected output, rather few additional rows are also present in output. Could someone please help me extracting the rows mentioned in expected output.

Thanks in advance!

Aucun commentaire:

Enregistrer un commentaire