Issue
I have a script which goal is to get comments number within this url : https://www.dealabs.com/discussions/suivi-erreurs-de-prix-1063390?page=9999 So my script normally would get "190330 commentaires" but after some lines of script he finds NoneType object ? I scrape the exact balisetype with his class or id name.
Here is my script :
from bs4 import BeautifulSoup
import time
import re
###########################SEARCH##################
while(True):
sent = 0
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
cookies={"cookie_policy_agreement" :"3"}
url = 'https://www.dealabs.com/discussions/suivi-erreurs-de-prix-1063390#comments'
response = requests.get(url, headers=headers, cookies=cookies)
html = response.text
soup = BeautifulSoup(html,'html.parser') #sinon html5lib
#Whole comments
comments = soup.find("div", id="comments")
comments = comments.find("section", class_="bg--main overflow--hidden bRad--fromW4-a")
comments = comments.find("div", class_="space--h-3 space--v-3") #this is none object ?
comments = comments.find("h2", class_="flex--inline boxAlign-ai--all-c")
comments = comments.find("span", class_="size--all-m size--fromW3-l text--b overflow--wrap-off").text
print(comments)
time.sleep(30)
Solution
When a tag's find
method returns None
, it means that that tag has no child element that satisfies the provided criteria. In this case, the <section>
element you found has no <div>
inside it with the classes space--h-3 space--v-3
. Looking at the page source at the link you provided, that is indeed the case. There is no such <div>
.
Either way, it seems you are doing a lot of unnecessary find
operations.
When an element on a page has an id
attribute that usually means there will not be another element with the same id
. Since you are looking for the number of "commentaires", I would try to start with the closest parent element that has an id
attribute.
In this case, that seems to be a <div id="thread-comments" ...>
closest to it. The line you are interested in also seems to be inside the only <h2>
tag below that aforementioned <div>
, or at least definitely the first one. Thus I would suggest the following optimization:
import re
...
soup = BeautifulSoup(html, 'html.parser')
comments_div = soup.find("div", id="thread-comments")
num_comments_line = comments_div.h2.get_text(strip=True)
# This is optional, if you actually want just the number itself:
match = re.search(r'^(\d+)\s+\w+', num_comments_line)
num_comments = int(match.group(1))
print(num_comments) # output: 189010
Note that these two are equivalent: (see docs)
comments_div.h2
comments_div.find("h2")
The last bit is just a regular expression to grab the number from the string that looks like 189010 commentaires
.
Answered By - Daniil Fajnberg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.