Issue
I'm stuck extracting text between <h1>
and </h1>
.
Please help me.
My code is:
import bs4
import re
import urllib2
url2='http://www.flipkart.com/mobiles/pr?sid=tyy,4io&otracker=ch_vn_mobile_filter_Top%20Brands_All#jumpTo=0|20'
htmlf = urllib2.urlopen(url2)
soup = bs4.BeautifulSoup(htmlf)
#res=soup.findAll('div',attrs={'class':'product-unit'})
for res in soup.findAll('a',attrs={'class':'fk-display-block'}):
suburl='http://www.flipkart.com/'+res.get('href')
subhtml = urllib2.urlopen(suburl)
subhtml = subhtml.read()
subhtml = re.sub(r'\s\s+','',subhtml)
subsoup=bs4.BeautifulSoup(subhtml)
res2=subsoup.find('h1',attrs={'itemprop':'name'})
if res2:
print res2
The output:
<h1 itemprop="name">Moto G</h1>
<h1 itemprop="name">Moto E</h1>
<h1 itemprop="name">Moto E</h1>
But I want this:
Moto G
Moto E
Moto E
Solution
On any HTML tag, doing a get_text()
gives the text associated with the tag. So, you just need to use get_text()
on res2. i.e.,
if res2:
print res2.get_text()
PS: As a side note, I think this line subhtml = re.sub(r'\s\s+','',subhtml)
in your code is an expensive operation. If all you are doing is getting rid of the excessive spaces, you could do that with:
if res2:
print res2.get_text().strip()
Answered By - shaktimaan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.