Issue
I'm trying to scrape thumbnail images from a website, but when I request BS to find a specific div class, it returns NONE. I tried this before on a similar website and I managed to to get everything within the desired div class but i'm running into issues here. If you have the time I would be extremely grateful for your advice.
Below is a sample of my code:
from pickle import NONE, TRUE
import requests
from bs4 import BeautifulSoup
import requests.exceptions
localfile = "C:/Users/XXX/Desktop/TapTap Apps/TapTap Page 1"
url = "https://www.taptap.cn/app/"
username = 'XXXX'
password = 'XXXX'
proxy = f"https://{username}:{password}@someproxysite.com"
def webScraper(url, proxy, min, max):
for x in range(min, max):
page = requests.get(url + str(x), proxy, timeout=10) # Request url and iterate with x
soup = BeautifulSoup(page.content, 'lxml')
image = soup.find('div', class_="tap-image-wrapper app-info-board__img") # Finds the HTML elements that holds the image
print (image)
webScraper(url, proxy, 12332, 13000)
Solution
The proxy configuration is wrong for starters, it should be passed as a dictionary instead.
proxies = {"https": proxy}
page = requests.get(url + str(x), proxies=proxies, timeout=10)
Also in request you can specify the user agent you can try for any but for example for chrome its
user_agent_det = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
headers = {'User-Agent':user_agent_det}
page = requests.get(url + str(x), headers=headers, proxies=proxies, timeout=10)
And to get the image
def webScraper(url, proxy, min, max):
proxies = {"https": proxy}
user_agent_det = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
headers = {'User-Agent':user_agent_det}
for x in range(min, max):
try:
page = requests.get(url + str(x), , headers=headers, proxies=proxies, timeout=10)
if page.status_code == 200:
soup = BeautifulSoup(page.content, 'lxml')
image = soup.find('div', class_="tap-image-wrapper app-info-board__img")
print(image)
else:
print(f"Failed to access {url} : {page.status_code}")
except Exception as e:
print(f"Failed : {e}")
Answered By - Arunbh Yashaswi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.