Thursday, November 11, 2021

[FIXED] Issue with scraping href in Python using Scrapy Spider

November 11, 2021 python, python-3.x, scrapy, web-scraping No comments

Issue

I am currently trying to scrape the href from the title on a craiglist page. I am using python scrapy, and have been having trouble with it

I have tried several things, I don't understand what is wrong.

import scrapy

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = {'https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date'}

    def parse(self,response):
        sel = Selector(response)
        for href in sel.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract_first():
            print(href)

There arent any error messages that show up, I just get zero results.

Solution

I fixed your code a bit to dump the hrefs (removed Selector and replaced extract_first with extract):

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']

    def parse(self, response):
        for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
            print('HREF:', href)

Output:

HREF: https://chicago.craigslist.org/chc/cto/d/chicago-2010-honda-cr-lx/6960935447.html
HREF: https://chicago.craigslist.org/chc/ctd/d/midlothian-2010-honda-cr-ex-4wd-5-speed/6960826946.html
HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2014-honda-cr-crv-lx-sport/6960791760.html
HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2016-honda-cr-crv-lx-sport/6960737848.html
HREF: https://chicago.craigslist.org/nch/cto/d/wilmette-honda-crv-2007/6960699975.html
HREF: https://chicago.craigslist.org/chc/ctd/d/westmont-2014-honda-cr-ex-skuel-suv/6960650987.html
...

Update - dumping results to json-file:

class HrefItem(scrapy.Item):
    href = scrapy.Field()

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']

    def parse(self, response):
        for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
            # print('HREF:', href)
            item = HrefItem()
            item['href'] = href
            yield item

Corresponding docs are here.

Answered By - LVK

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 11, 2021

[FIXED] Issue with scraping href in Python using Scrapy Spider

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels