Friday, November 10, 2023

[FIXED] How to retrieve nested data with BeautifulSoup?

November 10, 2023 beautifulsoup, loops, python, web-scraping No comments

Issue

I have the below webpage source:

</li>
    <li class="cl-static-search-result" title="BELLO HONDA ACCORD &quot;95 MIL 
      MILLAS&quot;. REALMENTE COMO NUEVO">
        <a href="link1">
            <div class="title">BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO</div>
            <div class="details">
                <div class="price">$4,600</div>
                <div class="location">
                    Miami
                </div>
            </div>
        </a>
    </li>
    <li class="cl-static-search-result" title="Honda Element">
        <a href=" link2 ">
            <div class="title">Honda Element</div>

            <div class="details">
                <div class="price">$4,950</div>
                <div class="location">
                    Coral springs
                </div>
            </div>
        </a>
    </li>
    <li class="cl-static-search-result" title="Mint Jeep">
        <a href=" link3 ">
            <div class="title">Mint Jeep</div>

            <div class="details">
                <div class="price">$8,500</div>
                <div class="location">
                    Pompano
                </div>
            </div>
        </a>
    </li>

I need to extract the data as below:

| URL  | TITLE               | PRICE  |
| ---- | ------------------- | ------ |
| link1 | BELLO HONDA ACCORD | $4,600 |
| link2 | Honda Element      | $4,950 |
| link3 | Mint Jeep          | $8,500 |

I am able to extract the URL names. When I attempt to get the title and price, it seems I am entering a loop that get the title/price for the full page after each URL link I get. Below is my code:

from urllib import request 
from bs4 import BeautifulSoup
from lxml import etree
import csv
page_url = 'URLNAME'
rawpage = request.urlopen(page_url)

soup = BeautifulSoup(rawpage, 'html5lib')

links_list = []

for link in soup.find_all('a'):              
    try:
       url = link.get('href')
    
       for div in soup.find_all('div', attrs={'class':'title'}):
         title = div.text
         print (title)


        links_list.append({'url': url})
    # if the row is missing anything...
    except AttributeError:
        #....skip it, dont blow up.
        pass

    # save it to csv
    with open('links.csv', 'w', newline='') as csv_out:
    csv_writer = csv.writer(csv_out)
    # Creta the header rows
    csv_writer.writerow(['url', 'title'])

    for row in links_list:
    csv_writer.writerow([str(row['url'])])

Solution

Try to change your strategy selecting / iterating elements and may use css selectors:

...
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):
    data.append({
        'link':e.a.get('href'),
        'title':e.get('title'),
        'price': e.select_one('.price').get_text()
    })
data

Process the list of dicts to write your file or create a dataframe, ...

Example

from bs4 import BeautifulSoup
html = '''
<li class="cl-static-search-result" title="BELLO HONDA ACCORD &quot;95 MIL 
      MILLAS&quot;. REALMENTE COMO NUEVO">
        <a href="link1">
            <div class="title">BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO</div>
            <div class="details">
                <div class="price">$4,600</div>
                <div class="location">
                    Miami
                </div>
            </div>
        </a>
    </li>
    <li class="cl-static-search-result" title="Honda Element">
        <a href=" link2 ">
            <div class="title">Honda Element</div>

            <div class="details">
                <div class="price">$4,950</div>
                <div class="location">
                    Coral springs
                </div>
            </div>
        </a>
    </li>
    <li class="cl-static-search-result" title="Mint Jeep">
        <a href=" link3 ">
            <div class="title">Mint Jeep</div>

            <div class="details">
                <div class="price">$8,500</div>
                <div class="location">
                    Pompano
                </div>
            </div>
        </a>
    </li>
'''
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):
    data.append({
        'link':e.a.get('href'),
        'title':e.get('title'),
        'price': e.select_one('.price').get_text()
    })
data

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 10, 2023

[FIXED] How to retrieve nested data with BeautifulSoup?

Issue

Solution

Example

0 comments:

Post a Comment

Popular Posts

Labels