Issue
I am new to scrapy and I've come across a complicated case.
My problem is that sometimes I have links like https://sitename.com/path2/?param1=value1¶m2=value2
and for me, query string is not important and I want to Drop it from requests.
I mean this part of the url:
?param1=value1¶m2=value2
After a day of research, I realized that this should be done in the middlewares.py file (Downloader Middleware) (Source). Because requests and receipts in Scrapy go through this path.
I tried to write a code so that the requests and answers are without query string, but I did not succeed.
My code does not drop requests that include query string.
middlewares.py:
from w3lib.url import url_query_cleaner
class CleanUrlAgentDownloaderMiddleware:
def process_response(self, request, response, spider):
url_query_cleaner(response.url)
return response
def process_request(self, request, spider):
url_query_cleaner(request.url)
How can I release these requests using the w3lib.url library or using Python codes? And don't enter Scrapy?
Just to let you know that I set my class in the settings.py
Solution
Since strings are immutable, your code will not change the anything in the requests. for your code to work you have to do
from w3lib.url import url_query_cleaner
class CleanUrlAgentDownloaderMiddleware:
# No need for process response since it will have the same
# url as the request
def process_request(self, request, spider):
if "?" in request.url:
return request.replace(url=url_query_cleaner(request.url))
alternately, if you want to ignore requests that have queries in their url you can do
from scrapy.exceptions import IgnoreRequest
from urllib.parse import urlparse
class IgnoreQueryRequestMiddleware:
def process_request(self, request, spider):
if urlparse(request.url).query:
raise IgnoreRequest
Answered By - zaki98
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.