Issue
I decided to post another question to address a new problem that rose after resolving the problem from Part 1 of this question.
My code below is supposed to retrieve the link, title, price, and location from web pages. It is working perfectly for the example below, but on certain pages, the code is either extracting few records or none at all. I am not sure why.
I am including the web example where code is not retrieving any or few records only.
Example (working)
from bs4 import BeautifulSoup
html = '''
<li class="cl-static-search-result" title="BELLO HONDA ACCORD "95 MIL
MILLAS". REALMENTE COMO NUEVO">
<a href="link1">
<div class="title">BELLO HONDA ACCORD "95 MIL MILLAS". REALMENTE COMO NUEVO</div>
<div class="details">
<div class="price">$4,600</div>
<div class="location">
Miami
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Honda Element">
<a href=" link2 ">
<div class="title">Honda Element</div>
<div class="details">
<div class="price">$4,950</div>
<div class="location">
Coral springs
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Mint Jeep">
<a href=" link3 ">
<div class="title">Mint Jeep</div>
<div class="details">
<div class="price">$8,500</div>
<div class="location">
Pompano
</div>
</div>
</a>
</li>
'''
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):
data.append({
'link':e.a.get('href'),
'title':e.get('title'),
'price': e.select_one('.price').get_text(),
'location': e.select_one('.location').get_text(strip=True)
})
data
Here is a non-working example; my code only retrieves the first row and skips the rest.
<li class="cl-static-search-result" title="Joyner 650">
<a href="LINK1">
<div class="title">Joyner 650</div>
<div class="details">
<div class="price">$4,100</div>
<div class="location">
Fernley
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="2009 Subaru legacy AWD!">
<a href="LINK2">
<div class="title">2009 Subaru legacy AWD!</div>
<div class="details">
<div class="price">$6,300</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="2011 international durastar 4300 box truck">
<a href="LINK3">
<div class="title">2011 international durastar 4300 box truck</div>
<div class="details">
<div class="price">$21,900</div>
<div class="location">
Lodi
</div>
</div>
</a>
</li>
Solution
Some of the LIs don't have a location. You need to check for this.
for e in soup.select('li[title]'):
price = e.select_one('.price')
location = e.select_one('.location')
data.append({
'link':e.a.get('href'),
'title':e.get('title'),
'price': price.get_text() if price else '$unknown',
'location': location.get_text(strip=True) if location else 'unknown'
})
Result with the first html:
[{'link': 'link1',
'location': 'Miami',
'price': '$4,600',
'title': 'BELLO HONDA ACCORD "95 MIL \n MILLAS". REALMENTE COMO NUEVO'},
{'link': ' link2 ',
'location': 'Coral springs',
'price': '$4,950',
'title': 'Honda Element'},
{'link': ' link3 ',
'location': 'Pompano',
'price': '$8,500',
'title': 'Mint Jeep'}]
Result with the second html:
[{'link': 'LINK1',
'location': 'Fernley',
'price': '$4,100',
'title': 'Joyner 650'},
{'link': 'LINK2',
'location': 'unknown',
'price': '$6,300',
'title': '2009 Subaru legacy AWD!'},
{'link': 'LINK3',
'location': 'Lodi',
'price': '$21,900',
'title': '2011 international durastar 4300 box truck'}]
Answered By - Barmar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.