jeudi 24 août 2017

divide list of sublist based on patter in sublist[0]

I have 2 lists of int and sparse matrix : list_index = [1,1,2,3,3,4,4,5] and matrix_user = [sparse1, sparse2, sparse3, sparse4, sparse5, sparse6]

I want to have a list of sublist, each sublist is made of a list of int and a sparse matrix : [ [[1,1,2,3,3],[sparse1, sparse2, sparse3, sparse4]] , [[4,4,5],[sparse5, sparse6]] , ......] of length ~ 90 (to run in parallel later one) whith each sublist[0] containing not overlapping value.

To cut the 2 input lists into 90 sections I do the following :

        # cut the data into chunk to run in parallel
        list_index = dfuser['idx'].tolist()
        matrix_user = encoder.fit_transform(dfuser[['col1','col2']].values)
        sizechunk = 90
        sizelist = int(len(list_index)/sizechunk)
        if len(list_index)%sizechunk!=0 : sizelist += 1

        list_all = []
        for i in range(sizechunk) :
            if i*sizelist > len(list_index) : continue
            if (i+1)*sizelist < len(list_index) : list_all.append(  [list_index[i*sizelist:(i+1)*sizelist] , matrix_user_encoded.tocsr()[i*sizelist:(i+1)*sizelist] ]  )
            else : list_all.append( [list_index[i*sizelist:] , matrix_user_encoded.tocsr()[i*sizelist:] ])

This give me a list of 90 chunks : [ [[1,1,2,3],[sparse1, sparse2, sparse3]] , [[3, 4,4,5],[sparse4, sparse5, sparse6]] , ......]

Then I filter in order each sublist have different index value :

        i=0
        size_list = len(list_all)
        while i<size_list-1 :
            last_elem = list_all[i][0][len(list_all[i][0])-1]
            first_elem = list_all[i+1][0][0]
            first_sparse = list_all[i+1][1][0]
            while first_elem==last_elem :
                list_all[i][0].append(first_elem)
                list_all[i][1] = sp.vstack((list_all[i][1],first_sparse))
                list_all[i+1][0] = list_all[i+1][0][1:]
                list_all[i+1][1] = list_all[i+1][1][1:]
                if len(list_all[i+1][0])==0 :
                    list_all.remove(list_all[i+1])
                    size_list -= 1
                    if i+1==size_list : break
                first_elem = list_all[i+1][0][0]
            i +=1

It works but as I have lots of input (~18 millions entries) it takes 6h !!!!!

I need my program to run in less than 2h as it needs to be called multiple times a day. Does a python command exists to cut my 2 lists depending on the pattern of the first sublist ?

Thank you for your help!

Aucun commentaire:

Enregistrer un commentaire