Issue
I am trying to write a simple scraping script to scrape off google summer of code orgs with the tech that I require. Its work in progress. My parse function is working fine but whenever I callback into org function it doesn't throw any output.
# -*- coding: utf-8 -*-
import scrapy
class GsocSpider(scrapy.Spider):
name = 'gsoc'
allowed_domains = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
start_urls = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
def parse(self, response):
for href in response.css('li.organization-card__container a.organization-card__link::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback = self.parse_org)
def parse_org(self,response):
tech=response.css('li.organization__tag organization__tag--technology::text').extract()
#if 'python' in tech:
yield
{
'name':response.css('title::text').extract_first()
#'ideas_list':response.css('')
}
Solution
first of all, you are configuring incorrectly the allowed_domains
, as it specifies in the documentation:
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled.
Let’s say your target url is https://www.example.com/1.html, then add 'example.com' to the list.
As you can see, you need to include only the domains, and this is a filtering functionality (so other domains don't get crawled). Also this is optional, so I would actually recommend to not include it.
Also your css
for getting tech
is incorrect, it should be:
li.organization__tag.organization__tag--technology
Answered By - eLRuLL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.