Issue
Use Python 3.11 and scrapy 2.71 on Windows 10. Following the Scrapy example downloading files from nirsoft.net, I made some adjustments for crawling another website (https://www.midi-karaoke.info) please take a look.
I'm not shure, but I hope to get most of html pages with my modified script (>100.000) but no .mid files.
This site behaves strange itself. It's a very flat design with >100.000 numbered pagenames. If I browse down to a Midi-file-link to download it, nothing happens. If I inspect the source in the browser and click on the .mid file I get it; or rename the page.extension to .mid in the addressfield of the browser with the link to the .mid file (https://www.midi-karaoke.info/21110cbd.html -> https://www.midi-karaoke.info/21110cbd.mid) I get it.
Furthermore changes made to my scrip,t sometimes they'r working but not at all. Next pass, next day, they may not work again with the same script. Here is what I use:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from webcrawler.items import WebcrawlerItem # import C:\..\scrapy\webcrawler\webcrawler\items.py
class WebcrawlSpider(CrawlSpider):
name = 'webcrawl'
allowed_domains = ['www.midi-karaoke.info']
start_urls = ['https://www.midi-karaoke.info']
# Redirections vermeiden ?
custom_settings = {'REDIRECT_ENABLED': False}
handle_httpstatus_list = [302, 301]
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
# webseite befindet sich nur in '/'
Rule(LinkExtractor(allow=(r'/'), deny_extensions=[], restrict_xpaths=('//a[@href]')), callback="parse_items", follow= True),
# extrahiere 'href' links
Rule(LinkExtractor(allow=(r'/'), restrict_xpaths=('//a[@class="MIDI"]',)), callback="parse_items", follow= True),
# href links die uns interessieren befinden sich in class='MIDI'
)
def parse_item(self, response):
file_url = response.css('.downloadline::attr(href)').getall() # hole alle gefundenen Seiten
file_url = response.urljoin(file_url)
file_extension = file_url.split('.')[-1]
# filtere links nach Dateien mir Extension (Optional)
if file_extension not in ('mid' , 'html', 'zip'):
return
#if '.ru.' in file_url or '.en.' in file_url:
# return
item = WebcrawlerItem()
item['file_urls'] = [file_url]
item['original_file_name'] = file_url.split('/')[-1]
yield item
This works sometimes and sometimes not. Please help.
# Redirections vermeiden ?
custom_settings = {'REDIRECT_ENABLED': False}
handle_httpstatus_list = [302, 301]
settings.py:
# Scrapy settings for webcrawler project
BOT_NAME = 'webcrawler'
SPIDER_MODULES = ['webcrawler.spiders']
NEWSPIDER_MODULE = 'webcrawler.spiders'
DUPEFILTER_DEBUG = False
REDIRECT_ENABLED = False
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'webcrawler.pipelines.WebcrawlerPipeline': 1,
}
FILES_STORE = r"C:\Users\wiwa53\scrapy\webcrawler\downloads"
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
items.py:
# Define here the models for your scraped items
import scrapy
class WebcrawlerItem(scrapy.Item):
file_urls = scrapy.Field()
original_file_name = scrapy.Field()
files = scrapy.Field
pipelines.py:
# Define your item pipelines here
from scrapy.pipelines.files import FilesPipeline
class WebcrawlerPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
file_name: str = request.url.split("/")[-1]
#print(file_name)
return file_name
Solution
There are a few noticdable issues in your code. For example you have multiple rules for that will match the same url's, you also list a non-existing callback method parse_items
in your rules. You also haven't provided your WebCrawlerItem
so its impossible to know if that has
the appropriate fields. I also don't see any attempt at all to extract description information that you mention needing.
Here is an example I made that scrapes the first page of the site which happens to be all of the A
s and then parses the inner pages for the track information, and then downloads the files to their respective folder.
I just use a standard scrapy.Spider
and I include versions of the item class and FilePipeline and all the custom settings in the same script:
import scrapy
from scrapy.pipelines.files import FilesPipeline
# from scrapy.crawler import CrawlerProcess
import os
class MyPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
return os.path.join(item['artist'], item['title'])
class WebcrawlerItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
original_file_name = scrapy.Field()
artist = scrapy.Field()
title = scrapy.Field()
class WebcrawlSpider(scrapy.Spider):
name = 'webcrawl'
allowed_domains = ['www.midi-karaoke.info']
start_urls = ['https://www.midi-karaoke.info']
custom_settings = {'REDIRECT_ENABLED': False}
handle_httpstatus_list = [302, 301]
def parse(self, response, **kwargs):
for link in response.xpath("//div[@class='folders_and_files']/a"):
text = link.xpath('./text()').get()
if link.xpath('./@class').get() == 'f':
kw = {'title': text}
callback = self.parse_item
elif link.xpath('./text()').get() != '..':
kw = {'artist': text}
callback = self.parse
else:
continue
url = response.urljoin(link.xpath('./@href').get())
kw.update(kwargs)
yield scrapy.Request(url, callback=callback, cb_kwargs=kw)
def parse_item(self, response, artist="", title=""):
midi = response.xpath('//table[@class="MIDI"]//table//a/@href').get()
link = response.urljoin(midi)
item = WebcrawlerItem()
item["artist"] = artist
item["title"] = title
item['file_urls'] = [link]
item['original_file_name'] = midi
yield item
# def main():
# process = CrawlerProcess(settings={
# "USER_AGNET": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
# "ROBOTSTXT_OBEY": False,
# "ITEM_PIPELINES" : {
# MyPipeline: 100,
# },
# "FILES_STORE": './folder',
# "FILES_URLS_FIELD": 'file_urls',
# "FILES_RESULT_FIELD": 'files',
# })
# process.crawl(WebcrawlSpider)
# process.start()
if __name__ == "__main__":
main()
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.