dimanche 8 mars 2020

Exploring and finding patterns/similarity in Data with Python

as I was learning Python,was working on one of the dataset like this:

**Col1**                                 **Col2**      **Col3**        
dog                                        Z             st02          
dog,cat                                    Z             st02          
dog,bat,cat                                Z             st02          
bat,cat,elephant                           Y             st02          
dog,bat,cat,elephant                       Y             st02          
tiger                                      Z             st01          
pigeon                                     Z             st01          
pigeon,parrot                              Z             st01          
dove,parrot                                Z             st01          
pigeon,parrot                              Z             st01          
pigeon,parrot,dove                         Z             st01          
lion,leopard,cheetah                       Z             st01          
tiger,lion,leopard,cheetah                 Z             st01          
dog,tiger,cheetah                          Y             st01          
dog,tiger,leopard,cheetah                  Y             st01          
eagle,jaguar,Kangaroo,zebra                Z             st02          
cheetah,eagle,jaguar,Kangaroo,zebra        Z             st02          

The expected output is:

**Col1**                                 **Col2**       **Col3**      
dog,bat,cat                                Z              st02          
dog,bat,cat,elephant                       Y              st02          
tiger,lion,leopard,cheetah                 Z              st01          
dog,tiger,leopard,cheetah                  Y              st01          
cheetah,eagle,jaguar,Kangaroo,zebra        Z              st02          
pigeon,parrot,dove                         Z              st01          

In order to extract the above rows as output, I tried tracing the patterns and using the below logic:

data = pd.read_excel("data.xlsx")
data['Col4'] = data['Col1'].str.count(',')
v1 = []
v2 = []
v1.append(0)
v2.append(0)
for i in range(0,data.shape[0]-1):
    x = data['Col_2'][i]
    y = data['Col_2'][i+1]
    t1 = data['Col_3'][i]
    t2 = data['Col_3'][i+1]
    g1 = (x == y) & (t1==t2)
    d1 = data['Col_1'][i]
    d2 = data['Col_1'][i+1]
    c1 = data['Col_4'][i]
    c2 = data['Col_4'][i+1]
    flag = 0
    if(all(x in d2 for x in d1)):
      flag = 1
    g2 = (flag == 1)&(c2>c1)
    v1.append(g1)
    v2.append(g2)
    data['new_cond1'] = v1   
    data['new_cond2'] = v2   
    data['Final_flag'] = (data['new_cond1']==True)&(data['new_cond2']==True) 
    data_output = data[data['Final_flag']==True]  

But I didn't end up getting the expected output, rather few additional rows are also present in output. Could someone please help me extracting the rows mentioned in expected output.

Thanks in advance!

Aucun commentaire:

Enregistrer un commentaire