Issue
I am currently trying to scrape the href from the title on a craiglist page. I am using python scrapy, and have been having trouble with it
I have tried several things, I don't understand what is wrong.
import scrapy
class MySpider(scrapy.Spider):
name = "HondaUrl"
start_urls = {'https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date'}
def parse(self,response):
sel = Selector(response)
for href in sel.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract_first():
print(href)
There arent any error messages that show up, I just get zero results.
Solution
I fixed your code a bit to dump the hrefs (removed Selector
and replaced extract_first
with extract
):
class MySpider(scrapy.Spider):
name = "HondaUrl"
start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']
def parse(self, response):
for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
print('HREF:', href)
Output:
HREF: https://chicago.craigslist.org/chc/cto/d/chicago-2010-honda-cr-lx/6960935447.html
HREF: https://chicago.craigslist.org/chc/ctd/d/midlothian-2010-honda-cr-ex-4wd-5-speed/6960826946.html
HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2014-honda-cr-crv-lx-sport/6960791760.html
HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2016-honda-cr-crv-lx-sport/6960737848.html
HREF: https://chicago.craigslist.org/nch/cto/d/wilmette-honda-crv-2007/6960699975.html
HREF: https://chicago.craigslist.org/chc/ctd/d/westmont-2014-honda-cr-ex-skuel-suv/6960650987.html
...
Update - dumping results to json-file:
class HrefItem(scrapy.Item):
href = scrapy.Field()
class MySpider(scrapy.Spider):
name = "HondaUrl"
start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']
def parse(self, response):
for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
# print('HREF:', href)
item = HrefItem()
item['href'] = href
yield item
Corresponding docs are here.
Answered By - LVK
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.