Issue
I have the below webpage source:
</li>
<li class="cl-static-search-result" title="BELLO HONDA ACCORD "95 MIL
MILLAS". REALMENTE COMO NUEVO">
<a href="link1">
<div class="title">BELLO HONDA ACCORD "95 MIL MILLAS". REALMENTE COMO NUEVO</div>
<div class="details">
<div class="price">$4,600</div>
<div class="location">
Miami
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Honda Element">
<a href=" link2 ">
<div class="title">Honda Element</div>
<div class="details">
<div class="price">$4,950</div>
<div class="location">
Coral springs
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Mint Jeep">
<a href=" link3 ">
<div class="title">Mint Jeep</div>
<div class="details">
<div class="price">$8,500</div>
<div class="location">
Pompano
</div>
</div>
</a>
</li>
I need to extract the data as below:
| URL | TITLE | PRICE |
| ---- | ------------------- | ------ |
| link1 | BELLO HONDA ACCORD | $4,600 |
| link2 | Honda Element | $4,950 |
| link3 | Mint Jeep | $8,500 |
I am able to extract the URL names. When I attempt to get the title and price, it seems I am entering a loop that get the title/price for the full page after each URL link I get. Below is my code:
from urllib import request
from bs4 import BeautifulSoup
from lxml import etree
import csv
page_url = 'URLNAME'
rawpage = request.urlopen(page_url)
soup = BeautifulSoup(rawpage, 'html5lib')
links_list = []
for link in soup.find_all('a'):
try:
url = link.get('href')
for div in soup.find_all('div', attrs={'class':'title'}):
title = div.text
print (title)
links_list.append({'url': url})
# if the row is missing anything...
except AttributeError:
#....skip it, dont blow up.
pass
# save it to csv
with open('links.csv', 'w', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
# Creta the header rows
csv_writer.writerow(['url', 'title'])
for row in links_list:
csv_writer.writerow([str(row['url'])])
Solution
Try to change your strategy selecting / iterating elements and may use css selectors
:
...
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):
data.append({
'link':e.a.get('href'),
'title':e.get('title'),
'price': e.select_one('.price').get_text()
})
data
Process the list
of dicts to write your file or create a dataframe
, ...
Example
from bs4 import BeautifulSoup
html = '''
<li class="cl-static-search-result" title="BELLO HONDA ACCORD "95 MIL
MILLAS". REALMENTE COMO NUEVO">
<a href="link1">
<div class="title">BELLO HONDA ACCORD "95 MIL MILLAS". REALMENTE COMO NUEVO</div>
<div class="details">
<div class="price">$4,600</div>
<div class="location">
Miami
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Honda Element">
<a href=" link2 ">
<div class="title">Honda Element</div>
<div class="details">
<div class="price">$4,950</div>
<div class="location">
Coral springs
</div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Mint Jeep">
<a href=" link3 ">
<div class="title">Mint Jeep</div>
<div class="details">
<div class="price">$8,500</div>
<div class="location">
Pompano
</div>
</div>
</a>
</li>
'''
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):
data.append({
'link':e.a.get('href'),
'title':e.get('title'),
'price': e.select_one('.price').get_text()
})
data
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.