Issue
I am trying to scrape a website for links and after scraping, I also want to see if the links that I scraped are just article or contain more links and if they do, I want to scrape those links as well. I am trying to implement it using BeautifulSoup 4 and this is what I have as code so far:
import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
r = requests.get(url, headers={'User-Agent': user_agent})
soup = BeautifulSoup(r.text, 'html.parser')
for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
link = post.find('a').get('href')
print(link)
r = requests.get(link, headers={'User-Agent': user_agent})
soup1 = BeautifulSoup(r.text, 'html.parser')
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
link1 = post1.find('a').get('href')
print(link1)
except Exception as e:
print(e)
I want the links on the page https://www.lbbusinessjournal.com/ and scrape for possible links inside the links that I get from that page for example https://www.lbbusinessjournal.com/news/, I want the links inside https://www.lbbusinessjournal.com/news/ as well. So far, I am only getting the links from the main page only.
Solution
Try raise e
from your except
clause and you will see that the error
AttributeError: 'NoneType' object has no attribute 'get'
arises from the line link1 = post1.find('a').get('href')
, where post1.find('a')
returns None
- this is because at least one of the HTML h3
elements you retrieve does not have an a
element - in fact, it looks like the link is commented out in the HTML.
Instead, you should split this post1.find('a').get('href')
call into two steps and check whether the element returned by post1.find('a')
is not None
before trying to get the 'href'
attribute, i.e.:
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
element = post1.find('a')
if element is not None:
link1 = element.get('href')
print(link1)
Output from running your code with this change:
https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...
Answered By - dspencer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.