Thursday, November 4, 2021

[FIXED] Scrapy: checking if the tag has another tag inside it and scrape both elements

November 04, 2021 scrapy, web-scraping No comments

Issue

I am trying to scrape an html page that uses this structure:

<div class="article-body">
    <div id="firstBodyDiv">
        <p class="ng-scope">
            This is a dummy text for explanation purposes
        </p>
        <p> class="ng-scope">
          This is a <a>dummy</a> text for explanation purposes
        </p>
    </div>
</div>

as you can see some of the P elements have a elements and some dont. What i did so far is the following:

economics["article_content"] = response.css("div.article-body div#firstBodyDiv > p:nth-child(n+1)::text").extract()

but it returns only the text before and after the a element if there is an aelement inside the p element

while this query return the a(s) elements:

response.css("div.article-body div#firstBodyDiv p:nth-child(n+1) a::text").extract()

i want to find a way to check whether there is an a element or not so i can execute the other query(the one who scrape the text inside the a element)

this is what i did so far to do so:

for i in response.css("div.article-body div#firstBodyDiv p:nth-child(n+1)"):
    if response.css("div.article-body div#firstBodyDiv p:nth-child(n+1) a") in i : 
        # ofcourse this isnt working since and i am getting this error 
        # 'in <string>' requires string as left operand, not SelectorList
        # probably i will have a different list1, list1.append() the p 
        # before, a, and the p text after the a element
        # assign that list to economics["article_content"]

Although i am using css selectors, you are welcome to use xpath selectors.

Solution

You can use the descendant-or-self functionality from xpath, which will get all inner texts.

for i in response.css('div.article-body div#firstBodyDiv > p:nth-child(n+1)'):
 print(''.join(i.xpath('descendant-or-self::text()').extract()))

You can also use scrapy shell in order to test your code with raw HTML like so:

$ scrapy shell
from scrapy.http import HtmlResponse
response = HtmlResponse(url='test', body='''<div class="article-body"> 
   <div id="firstBodyDiv"> 
       <p class="ng-scope"> 
           This is a dummy text for explanation purposes 
       </p> 
       <p class="ng-scope"> 
         This is a <a>dummy</a> text for explanation purposes 
       </p> 
   </div> 
</div> 
''', encoding='utf-8')
for i in response.css('div.article-body div#firstBodyDiv > p:nth-child(n+1)'):
     print(''.join(i.xpath('descendant-or-self::text()').extract()))

Answered By - Sewake

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 4, 2021

[FIXED] Scrapy: checking if the tag has another tag inside it and scrape both elements

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels