jeudi 25 février 2021

Python regular expression findall() not returning desired list

I am trying to extract urls from a big string using regular expressions.

This is my regular expression

url_pattern = r'(https?)*[-a-zA-Z0-9@:%._\+~#=\/\/]*\.[a-zA-Z0-9()]*\b([-a-zA-Z0-9()@:%_\+~#?&//=]*)'

This matches some undesired strings like 2.5 D.C etc. How can I get away with this and match only url patterns which may or may not contain 'http' or 'www.'?

Also, when I use urls = re.findall(url_pattern, text) to get a list of matched substring it returns this

text = 'The petition should be updated. The education committee has increased the spending from the Mayor\'s proposal -- a 2.38% increase over last year.  https://www.washingtonpost.com/local/education/despite-increases-in-funding-some-dc-schools-face-possible-cuts/2017/05/18/9e132516-3bdf-11e7-a058-ddbb23c75d82_story.html?utm_term=.5c053ce98ec2  On Thursday, the D.C. Council’s Committee on Education voted to add more money, raising the per-pupil spending increase to 2.38 percent. That measure now moves to the full council for a decision.'
Output:
[('', '%'), ('https', '?utm_term='), ('', ''), ('', ''), ('', '')]

How can I get it to return desired url list?

Edit: regexr link where match occurs although undesired strings like '2.8' 'D.C' etc are included: regexr link

Aucun commentaire:

Enregistrer un commentaire