Issue
I am trying to scrape an html page that uses this structure:
<div class="article-body">
<div id="firstBodyDiv">
<p class="ng-scope">
This is a dummy text for explanation purposes
</p>
<p> class="ng-scope">
This is a <a>dummy</a> text for explanation purposes
</p>
</div>
</div>
as you can see some of the P elements have a elements and some dont. What i did so far is the following:
economics["article_content"] = response.css("div.article-body div#firstBodyDiv > p:nth-child(n+1)::text").extract()
but it returns only the text before and after the a
element if there is an a
element inside the p
element
while this query return the a(s)
elements:
response.css("div.article-body div#firstBodyDiv p:nth-child(n+1) a::text").extract()
i want to find a way to check whether there is an a
element or not so i can execute the other query(the one who scrape the text inside the a
element)
this is what i did so far to do so:
for i in response.css("div.article-body div#firstBodyDiv p:nth-child(n+1)"):
if response.css("div.article-body div#firstBodyDiv p:nth-child(n+1) a") in i :
# ofcourse this isnt working since and i am getting this error
# 'in <string>' requires string as left operand, not SelectorList
# probably i will have a different list1, list1.append() the p
# before, a, and the p text after the a element
# assign that list to economics["article_content"]
Although i am using css selectors, you are welcome to use xpath selectors.
Solution
You can use the descendant-or-self
functionality from xpath, which will get all inner texts.
for i in response.css('div.article-body div#firstBodyDiv > p:nth-child(n+1)'):
print(''.join(i.xpath('descendant-or-self::text()').extract()))
You can also use scrapy shell
in order to test your code with raw HTML like so:
$ scrapy shell
from scrapy.http import HtmlResponse
response = HtmlResponse(url='test', body='''<div class="article-body">
<div id="firstBodyDiv">
<p class="ng-scope">
This is a dummy text for explanation purposes
</p>
<p class="ng-scope">
This is a <a>dummy</a> text for explanation purposes
</p>
</div>
</div>
''', encoding='utf-8')
for i in response.css('div.article-body div#firstBodyDiv > p:nth-child(n+1)'):
print(''.join(i.xpath('descendant-or-self::text()').extract()))
Answered By - Sewake
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.