I am trying to extract urls from a big string using regular expressions.
This is my regular expression
url_pattern = r'(https?)*[-a-zA-Z0-9@:%._\+~#=\/\/]*\.[a-zA-Z0-9()]*\b([-a-zA-Z0-9()@:%_\+~#?&//=]*)'
This matches some undesired strings like 2.5 D.C etc. How can I get away with this and match only url patterns which may or may not contain 'http' or 'www.'?
Also, when I use urls = re.findall(url_pattern, text)
to get a list of matched substring it returns this
text = 'The petition should be updated. The education committee has increased the spending from the Mayor\'s proposal -- a 2.38% increase over last year. https://www.washingtonpost.com/local/education/despite-increases-in-funding-some-dc-schools-face-possible-cuts/2017/05/18/9e132516-3bdf-11e7-a058-ddbb23c75d82_story.html?utm_term=.5c053ce98ec2 On Thursday, the D.C. Council’s Committee on Education voted to add more money, raising the per-pupil spending increase to 2.38 percent. That measure now moves to the full council for a decision.'
Output:
[('', '%'), ('https', '?utm_term='), ('', ''), ('', ''), ('', '')]
How can I get it to return desired url list?
Edit: regexr link where match occurs although undesired strings like '2.8' 'D.C' etc are included: regexr link
Aucun commentaire:
Enregistrer un commentaire