Monday, November 8, 2021

[FIXED] I am trying to scrape a website for links and also scrape the links inside the already scraped links

November 08, 2021 beautifulsoup, python, python-requests-html No comments

Issue

I am trying to scrape a website for links and after scraping, I also want to see if the links that I scraped are just article or contain more links and if they do, I want to scrape those links as well. I am trying to implement it using BeautifulSoup 4 and this is what I have as code so far:

import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
    r = requests.get(url, headers={'User-Agent': user_agent})
    soup = BeautifulSoup(r.text, 'html.parser')
    for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
        link = post.find('a').get('href')
        print(link)
        r = requests.get(link, headers={'User-Agent': user_agent})
        soup1 = BeautifulSoup(r.text, 'html.parser')
        for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
            link1 = post1.find('a').get('href')
            print(link1)
except Exception as e:
    print(e)

I want the links on the page https://www.lbbusinessjournal.com/ and scrape for possible links inside the links that I get from that page for example https://www.lbbusinessjournal.com/news/, I want the links inside https://www.lbbusinessjournal.com/news/ as well. So far, I am only getting the links from the main page only.

Solution

Try raise e from your except clause and you will see that the error

AttributeError: 'NoneType' object has no attribute 'get'

arises from the line link1 = post1.find('a').get('href'), where post1.find('a') returns None - this is because at least one of the HTML h3 elements you retrieve does not have an a element - in fact, it looks like the link is commented out in the HTML.

Instead, you should split this post1.find('a').get('href') call into two steps and check whether the element returned by post1.find('a') is not None before trying to get the 'href' attribute, i.e.:

for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):                                                     
    element = post1.find('a')                                           
    if element is not None:                                             
        link1 = element.get('href')                                     
        print(link1)

Output from running your code with this change:

https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...

Answered By - dspencer

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 8, 2021

[FIXED] I am trying to scrape a website for links and also scrape the links inside the already scraped links

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels