jeudi 1 décembre 2022

Python extract values from different substrings

I have a dataframe named df which has a column named "text" consisting of each line which would be an example string like this:

d20s 22 i2as¶001VNINDEX455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0441 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶8564 $q9$uAgency XYZ¶7146 $q1$uAgency ABC$fHTML$

Here I want to extract information containing zones for example zone 7146/$u or zone 0441/$c.

The result will be like this :

7146$u 0441$c
wwww.abc.org 2609-2565
Agency XYZ 2609-2565

Here is the code I made :

import os
import pandas as pd
import numpy as np
import requests


df = pd.read_csv('dataset.csv')

def extract(text, start_pattern, sc):
    ist = text.find(start_pattern)
    if ist < 0:
        return ""
    ist = text.find(sc, ist)
    if ist < 0:
        return ""
    im = text.find("$", ist + len(sc))
    iz = text.find("¶", ist + len(sc))
    if im >= 0:
        if iz >= 0:
            ie = min(im, iz)
        else:
            ie = im
    else:
        ie = iz
    if ie < 0:
        return ""
    return text[ist + len(sc): ie]

def extract_text(row, list_in_zones):
    text = row["text"]
    if pd.isna(text):
        return [""] * len(list_in_zones)
    patterns = [("¶" + p, "$" + c) for p, c in [zone.split("$") for zone in list_in_zones]]
    return [extract(text, pattern, sc) for pattern, sc in patterns]


list_in_zones = ["7146$u", "0441$u", "200$y"]


df[list_in_zones] = df.apply(lambda row: extract_text(row, list_in_zones),
                             axis=1,
                             result_type="expand")

df.to_excel("extract.xlsx", index = False)


I want to code dynamically so that later for example if I want to extract any line in the string just add this information to list_in_zone = [.....]. For example, I want to extract the zone 200 with $y then I add 200$y in list_in_zone = [.....]. I will get all the 200$y (duplicates if any) of the string.

In the my code, if there are duplicate zones for example 7146$u, I can only get the first information "wwww.abc.org", I cannot extract the duplicate information (if any) "Agency XYZ". Prove that my code has not taken the zone/$... if they are duplicates.

Can you help me please?

Aucun commentaire:

Enregistrer un commentaire