Issue
I have:
<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"
I want to get the url, however I don't know how to do that without the use of regex. Is it even possible?
so far my solution with regex is:
url = re.findall('\('(.*?)'\)', soup['style'])[0]
Solution
You could try using the cssutils package. Something like this should work:
import cssutils
from bs4 import BeautifulSoup
html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');" />"""
soup = BeautifulSoup(html)
div_style = soup.find('div')['style']
style = cssutils.parseStyle(div_style)
url = style['background-image']
>>> url
u'url(/uploads/images/players/16113-1399107741.jpeg)'
>>> url = url.replace('url(', '').replace(')', '') # or regex/split/find/slice etc.
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Although you are ultimately going to need to parse out the actual url this method should be more resilient to changes in the HTML. If you really dislike string manipulation and regex, you can pull the url out in this roundabout way:
sheet = cssutils.css.CSSStyleSheet()
sheet.add("dummy_selector { %s }" % div_style)
url = list(cssutils.getUrls(sheet))[0]
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Answered By - mhawke
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.