Saturday, January 27, 2024

[FIXED] WebScraping Requests Status gives me 200 without content

January 27, 2024 beautifulsoup, http-status-codes, python, python-requests, web-scraping No comments

Issue

I am practising scraping with Beautiful Soup. I want to scrape a certain all the results when finding for Data Analysis jobs at Daijob. There are 70 results divided in 7 pages of 10 results each.

website = 'https://www.daijob.com/en/jobs/search_result?job_search_form_hidden=1&keywords=Data+Analyst'

for page in range(20):

        time.sleep(1)

        r = requests.get(website, params = {"page" : page+1})
        if r.status_code != 200:
            break
        else:
            html = r.content
            soup = BeautifulSoup(html, "lxml")
            print('\033[1m' + 'Web 1, page {0}'.format(page+1) + '\033[0m')

So the idea was that the number of pages would keep increasing and when the page number goes to 8, the loop would stop.

It has worked in other websites since the status_code value went to 410 instead of 200, once it reached a page number that didn't have data.

But in this case no matter what number of page you put (it can be even 100000), it keeps giving a status_code of 200 so I can't make the loop stop even when there are no more useful data to scrape.

Is there a more efficient way to stop that loop automatically?

Solution

When no jobs are found, the website shows this message: No jobs were found that matched your search. You can use this in order to find out whether the page contains any jobs or not. Here is the full code:

import time
import requests
from bs4 import BeautifulSoup

website = 'https://www.daijob.com/en/jobs/search_result?job_search_form_hidden=1&keywords=Data+Analyst'

page = 0

while True:

        time.sleep(1)

        r = requests.get(website, params = {"page" : page+1})
        if 'No jobs were found that matched your search.' in r.text:
            break
        else:
            html = r.content
            soup = BeautifulSoup(html, "lxml")
            print('\033[1m' + 'Web 1, page {0}'.format(page+1) + '\033[0m')
            page += 1

Output:

Web 1, page 1
Web 1, page 2
Web 1, page 3
Web 1, page 4
Web 1, page 5

Answered By - Sushil

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 27, 2024

[FIXED] WebScraping Requests Status gives me 200 without content

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels