Sunday, December 17, 2023

[FIXED] Scrapy crawl behaves strangly

December 17, 2023 python, scrapy, web-crawler No comments

Issue

Use Python 3.11 and scrapy 2.71 on Windows 10. Following the Scrapy example downloading files from nirsoft.net, I made some adjustments for crawling another website (https://www.midi-karaoke.info) please take a look.

I'm not shure, but I hope to get most of html pages with my modified script (>100.000) but no .mid files.

This site behaves strange itself. It's a very flat design with >100.000 numbered pagenames. If I browse down to a Midi-file-link to download it, nothing happens. If I inspect the source in the browser and click on the .mid file I get it; or rename the page.extension to .mid in the addressfield of the browser with the link to the .mid file (https://www.midi-karaoke.info/21110cbd.html -> https://www.midi-karaoke.info/21110cbd.mid) I get it.

Furthermore changes made to my scrip,t sometimes they'r working but not at all. Next pass, next day, they may not work again with the same script. Here is what I use:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from webcrawler.items import WebcrawlerItem # import C:\..\scrapy\webcrawler\webcrawler\items.py

class WebcrawlSpider(CrawlSpider):
    name = 'webcrawl'
    allowed_domains = ['www.midi-karaoke.info']
    start_urls = ['https://www.midi-karaoke.info']

    # Redirections vermeiden ?
    custom_settings = {'REDIRECT_ENABLED': False}
    handle_httpstatus_list = [302, 301]

    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
        # webseite befindet sich nur in '/'
        Rule(LinkExtractor(allow=(r'/'), deny_extensions=[], restrict_xpaths=('//a[@href]')), callback="parse_items", follow= True),
        # extrahiere 'href' links
        Rule(LinkExtractor(allow=(r'/'), restrict_xpaths=('//a[@class="MIDI"]',)), callback="parse_items", follow= True),
        # href links die uns interessieren befinden sich in class='MIDI'   
    )

    def parse_item(self, response):
        file_url = response.css('.downloadline::attr(href)').getall() # hole alle gefundenen Seiten
        file_url = response.urljoin(file_url)       
        file_extension = file_url.split('.')[-1]
        # filtere links nach Dateien mir Extension (Optional)
        if file_extension not in ('mid' , 'html', 'zip'): 
            return
        #if '.ru.' in file_url or '.en.' in file_url:
         #   return
        item = WebcrawlerItem()
        item['file_urls'] = [file_url]
        item['original_file_name'] = file_url.split('/')[-1]
        yield item

This works sometimes and sometimes not. Please help.

 # Redirections vermeiden ?
    custom_settings = {'REDIRECT_ENABLED': False}
    handle_httpstatus_list = [302, 301]

settings.py:

# Scrapy settings for webcrawler project

BOT_NAME = 'webcrawler'
SPIDER_MODULES = ['webcrawler.spiders']
NEWSPIDER_MODULE = 'webcrawler.spiders'
DUPEFILTER_DEBUG = False
REDIRECT_ENABLED = False 
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
    'webcrawler.pipelines.WebcrawlerPipeline': 1,
}
FILES_STORE = r"C:\Users\wiwa53\scrapy\webcrawler\downloads"
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'

items.py:

# Define here the models for your scraped items
import scrapy

class WebcrawlerItem(scrapy.Item):
    file_urls = scrapy.Field()
    original_file_name = scrapy.Field()
    files = scrapy.Field

pipelines.py:

# Define your item pipelines here
from scrapy.pipelines.files import FilesPipeline

class WebcrawlerPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):        
        file_name: str = request.url.split("/")[-1]
        #print(file_name)
        return file_name

Solution

There are a few noticdable issues in your code. For example you have multiple rules for that will match the same url's, you also list a non-existing callback method parse_items in your rules. You also haven't provided your WebCrawlerItem so its impossible to know if that has the appropriate fields. I also don't see any attempt at all to extract description information that you mention needing.

Here is an example I made that scrapes the first page of the site which happens to be all of the As and then parses the inner pages for the track information, and then downloads the files to their respective folder.

I just use a standard scrapy.Spider and I include versions of the item class and FilePipeline and all the custom settings in the same script:

import scrapy
from scrapy.pipelines.files import FilesPipeline
# from scrapy.crawler import CrawlerProcess
import os

class MyPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        return os.path.join(item['artist'], item['title'])

class WebcrawlerItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()
    original_file_name = scrapy.Field()
    artist = scrapy.Field()
    title = scrapy.Field()

class WebcrawlSpider(scrapy.Spider):
    name = 'webcrawl'
    allowed_domains = ['www.midi-karaoke.info']
    start_urls = ['https://www.midi-karaoke.info']
    custom_settings = {'REDIRECT_ENABLED': False}
    handle_httpstatus_list = [302, 301]

    def parse(self, response, **kwargs):
        for link in response.xpath("//div[@class='folders_and_files']/a"):
            text = link.xpath('./text()').get()
            if link.xpath('./@class').get() == 'f':
                kw = {'title': text}
                callback = self.parse_item
            elif link.xpath('./text()').get() != '..':
                kw = {'artist': text}
                callback = self.parse
            else:
                continue
            url = response.urljoin(link.xpath('./@href').get())
            kw.update(kwargs)
            yield scrapy.Request(url, callback=callback, cb_kwargs=kw)

    def parse_item(self, response, artist="", title=""):
        midi = response.xpath('//table[@class="MIDI"]//table//a/@href').get()
        link = response.urljoin(midi)
        item = WebcrawlerItem()
        item["artist"] = artist
        item["title"] = title
        item['file_urls'] = [link]
        item['original_file_name'] = midi
        yield item

# def main():
#     process = CrawlerProcess(settings={
#         "USER_AGNET": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
#         "ROBOTSTXT_OBEY": False,
#         "ITEM_PIPELINES" : {
#                 MyPipeline: 100,
#                 },
#         "FILES_STORE": './folder',
#         "FILES_URLS_FIELD": 'file_urls',
#         "FILES_RESULT_FIELD": 'files',
#         })
#     process.crawl(WebcrawlSpider)
#     process.start()

if __name__ == "__main__":
    main()

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 17, 2023

[FIXED] Scrapy crawl behaves strangly

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels