Issue
I'd like to scrape all the linked javascript and css files on a give domain with Scrapy. The issue is that I don't quite understand how to extract the links from the link elements.
Assume I'm scraping example.com. There are links to js and css of the form:
<link rel="stylesheet" href="/path_to_css/example.css"/>
<script src="/path_to_js/example.js"></script>
These links start from the root domain, so no problem. But if the links are like the ones below, it starts to get confusing:
<link rel="stylesheet" href="path_to_css/example.css"/>
<script src="path_to_js/example.js"></script>
These relative URLs are supposed to work such that if I'm on example.com/some_page/
the link paths are appended to that like: example.com/some_page/path_to_js/example.js
. That's not how it always works in actual web pages however. On some web sites with language selection eg.example.com/en/some_page
, the relative paths start from example.com/en
instead of the full path of that page.
So, while expecting to find the files at example.com/en/some_page/path_to_js/example.js
, you find them at example.com/en/path_to_js/example.js
Is there any way to understand from where the relative paths start from?
Solution
While scraping, Scrapy allows you to create an absolute URL from a Relative URL
You could do something like this
for link in response.css("link"):
response.urljoin(link.css("::attr(href)").extract_first())
for script in response.css("script"):
response.urljoin(script.css("::attr(src)").extract_first())
Answered By - Umair Ayub
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.