Issue
I am practising scraping with Beautiful Soup. I want to scrape a certain all the results when finding for Data Analysis jobs at Daijob. There are 70 results divided in 7 pages of 10 results each.
website = 'https://www.daijob.com/en/jobs/search_result?job_search_form_hidden=1&keywords=Data+Analyst'
for page in range(20):
time.sleep(1)
r = requests.get(website, params = {"page" : page+1})
if r.status_code != 200:
break
else:
html = r.content
soup = BeautifulSoup(html, "lxml")
print('\033[1m' + 'Web 1, page {0}'.format(page+1) + '\033[0m')
So the idea was that the number of pages would keep increasing and when the page number goes to 8, the loop would stop.
It has worked in other websites since the status_code value went to 410 instead of 200, once it reached a page number that didn't have data.
But in this case no matter what number of page you put (it can be even 100000), it keeps giving a status_code of 200 so I can't make the loop stop even when there are no more useful data to scrape.
Is there a more efficient way to stop that loop automatically?
Solution
When no jobs are found, the website shows this message: No jobs were found that matched your search.
You can use this in order to find out whether the page contains any jobs or not. Here is the full code:
import time
import requests
from bs4 import BeautifulSoup
website = 'https://www.daijob.com/en/jobs/search_result?job_search_form_hidden=1&keywords=Data+Analyst'
page = 0
while True:
time.sleep(1)
r = requests.get(website, params = {"page" : page+1})
if 'No jobs were found that matched your search.' in r.text:
break
else:
html = r.content
soup = BeautifulSoup(html, "lxml")
print('\033[1m' + 'Web 1, page {0}'.format(page+1) + '\033[0m')
page += 1
Output:
Web 1, page 1
Web 1, page 2
Web 1, page 3
Web 1, page 4
Web 1, page 5
Answered By - Sushil
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.