Friday, December 3, 2021

[FIXED] Scrapy: response.body returning gibberish HTML ( ~ 95% of the time). Trying to diagnose

December 03, 2021 proxy, python, python-3.x, scrapy, web-scraping No comments

Issue

Problem Summary

I am attempting to load this URL (https://www.glassdoor.com/Reviews/reviews-SRCH_IP2.htm) via a yield scrapy.Request(url = url, callback = ...) method call, and the response.body property returns HTML that is not at all reminiscent to the HTML I expect to be returned.

An excerpt from the body that is returned in response.body:

<!DOCTYPE html>\n
<html lang=\'en\' xmlns:fb=\'http://www.facebook.com/2008/fbml\' xmlns:og=\'http://opengraph.org/schema/\'\n      class=\'flex\'>
   \n\n<head prefix=\'og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# glassdoor: http://ogp.me/ns/fb/glassdoor#\'><script src="https://browser.sentry-cdn.com/5.2.0/bundle.min.js" crossorigin="anonymous"></script><script>\n\t\n\tSentry.init(\n\t\t{\n\t\t\tdsn: \'https://[email protected]/8\',\n\t\t\tenvironment: \'prod\',\n\t\t\tsampleRate: 0.0\n\t\t}\n\t);\n\tSentry.configureScope(function(scope){\n\t\tscope.setUser(\n\t\t\t{\n\t\t\t\tid: \'0\',\n\t\t\t\tguid: \'0b8f8e55-d91d-4ea7-848a-0a3a1b215fc8\'\n\t\t\t}\n\t\t);\n\t});\n</script><!-- because the getter clears the value --><script>\n\twindow.gdGlobals = window.gdGlobals ||\n\t\t[{\n\t\t\t\'analyticsId\':

The full body of the above HTML also does not contain any of the body content I am trying to scrape.

An excerpt from the HTML body when personally visiting the URL:

<!DOCTYPE html>
<html lang='en' xmlns:fb='http://www.facebook.com/2008/fbml'xmlns:og='http://opengraph org/schema/'class='flex'>
   <head prefix='og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# glassdoor: http://ogp.me/ns/fb/glassdoor#'>
      <script src="https://browser.sentry-cdn.com/5.2.0/bundle.min.js" crossorigin="anonymous"></script><script>
         Sentry.init(
            {
                dsn: 'https://[email protected]/8',
                environment: 'prod',
                sampleRate: 0.0
            }
         );

What I Have Tried

I have another spider (spider1, let's call it) that is calling scrapy.Request() successfully and returning the expected HTML. The main difference between spider1 and this spider is that spider1 requires a login to access the information. I have tried requesting the URL above both before and after logging in, but the returned HTML is the same. Additionally, Glassdoor does not require a user login to read the contents for the URL I have linked to above, so I do not believe that is what is causing the issue.

My Code and What Is Weird

The code to call this is below:

start_urls = ["https://www.glassdoor.com/Reviews/reviews-SRCH_IP2.htm"]
yield scrapy.Request(url = self.start_urls[0], callback = self.process_page)

The weird part is that, when debugging, the HTML actually returns properly (but very infrequently – I'd estimate maybe 1/20 times). This occurs without any code changes, and I am having significant difficulty in determining what causes this to work in those rare instances.

My Thoughts

The only slightly valid suspicion I have here is that I need to implement a spider proxy. Glassdoor could be intentionally blocking my requests, explaining why the HTML only correctly returns while debugging – again, this happens roughly 1/20 run-throughs, and it has never returned correctly without breakpoints leading up the scrapy.Request() call.

Thank you very much for any advice and/or pointers. It is greatly appreciated!

Solution

The page is rendered with JavaScript and XHR. So you need something which can handle this. So use

sudo pip3 install scrapy-selenium

Get a correct driver for your operating system from https://github.com/mozilla/geckodriver/releases if using Firefox or another driver if using another browser see https://www.seleniumhq.org/download/

spider.py

import scrapy
from scrapy_selenium import SeleniumRequest


class Spider(scrapy.Spider):
    name = "spider"

    start_urls = ["https://www.glassdoor.com/Reviews/reviews-SRCH_IP2.htm"]

    def start_requests(self):
        yield SeleniumRequest(url=self.start_urls[0], callback=self.parse_result)

    def parse_result(self, response):
        for url in response.selector.css('span .url'):
            print(url)
        for title in response.selector.css('.tightAll'):
            print(title)

settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'spider'

SPIDER_MODULES = ['spider.spiders']
NEWSPIDER_MODULE = 'spider.spiders'


SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = '/usr/local/bin/geckodriver'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

ROBOTSTXT_OBEY = True



DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

Outputs:

..
<Selector xpath="descendant-or-self::span/descendant-or-self::*/*[@class and contains(concat(' ', normalize-space(@class), ' '), ' url ')]" data='<span class="url">www.pwc.com</span>'>
<Selector xpath="descendant-or-self::span/descendant-or-self::*/*[@class and contains(concat(' ', normalize-space(@class), ' '), ' url ')]" data='<span class="url">www.primark.com</span>'>
<Selector xpath="descendant-or-self::span/descendant-or-self::*/*[@class and contains(concat(' ', normalize-space(@class), ' '), ' url ')]" data='<span class="url">www.ey.com</span>'>
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' tightAll ')]" data='<a href="/Overview/Working-at-Tesco-E...'>
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' tightAll ')]" data='<a href="/Overview/Working-at-J-Sains...'>
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' tightAll ')]" data='<a href="/Overview/Working-at-NESTA-E...'>
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' tightAll ')]" data='<a href="/Overview/Working-at-McDonal...'>
...

For further reading see https://docs.scrapy.org/en/latest/topics/dynamic-content.html

Answered By - Dan-Dev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 3, 2021

[FIXED] Scrapy: response.body returning gibberish HTML ( ~ 95% of the time). Trying to diagnose

Issue

Problem Summary

What I Have Tried

My Code and What Is Weird

My Thoughts

Solution

0 comments:

Post a Comment

Popular Posts

Labels