Issue
I'm a complete newbie to parsing websites but I've had a script that pulls the figures of different housing sites which worked flawlessly for the past year. However, for a reason I can't figure out, it no longer works on daft.ie anymore. I've tried to debug but nothing I try seems to work. I either get 'list index out of range' or 'None' which I know it indicates the array is empty but its clearly not. Below is a snippet of the problem-some code.
Would appreciate someone who has more knowledge than I to have a look as I'm sure its going to be something which should be obvious.
Appreciate all the assistance from the site.
import sys
import requests
from bs4 import BeautifulSoup
def get_buy_numbers_dublin_city():
page = requests.get("https://www.daft.ie/property-for-sale/dublin-city")
soup = BeautifulSoup(page.content, 'html.parser')
prop_num = str(soup.find_all(class_="styles__SearchH1-sc-1t5gb6v-3 guZHZl")[0])
prop_num = prop_num.replace('<h1 class="styles__SearchH1-sc-1t5gb6v-3 guZHZl" data-testid="search-h1">', '')
prop_num = prop_num.replace(' Properties for Sale in Dublin City</h1>', '')
prop_num = prop_num.replace(',', '')
return(prop_num)
def main(argv):
print(get_buy_numbers_dublin_city())
if __name__ == "__main__":
main(sys.argv[1:])
Solution
One issue would be that this site is also protecting its content, so you always should take a closer look into response text or soup
, because in this case non of the content you would expect is in the HTML.
You could add an user-agent
to avoid these behavior for some time or use selenium
and co. to mimic browser. Be aware that if some other of your scraping behavior is detected, server may block you again.
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.daft.ie/property-for-sale/dublin-city", headers={'user-agent':'some-agent'})
soup = BeautifulSoup(page.content)
print(soup.h1.text.split()[0])
Will give you:
2,544
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.