Issue
I've been trying to get this specific data set extracted and to use with Scrapy for a scraping project. My current python code is:
bedrooms_info = house_listing.css(
'.search-results-listings-list__item-description__characteristics__item:contains("Chambres") ::text').get()
bedrooms = self.extract_number(bedrooms_info) if bedrooms_info else None
The extract number method described above is:
def extract_number(self, value):
try:
# Use regular expression to extract numeric values
match = re.search(r'\d+', value)
return int(match.group()) if match else None
except (TypeError, ValueError):
return None
And the HTML sequence of the website in question is:
<div class="search-results-listings-list__item-description__item search-results-listings-list__item-description__characteristics">
<div class="search-results-listings-list__item-description__characteristics__item">
<!--?xml version="1.0"?-->
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 46 41" class="search-results-listings-list__item-description__characteristics__icon search-results-listings-list__item-description__characteristics__icon--bedrooms"><path d="M5.106 0c-.997 0-1.52.904-1.52 1.533v11.965L.074 23.95c-.054.163-.074.38-.074.486V39.2c-.017.814.727 1.554 1.54 1.554.796 0 1.54-.74 1.52-1.554v-3.555h39.88V39.2c-.016.814.724 1.554 1.52 1.554.813 0 1.56-.74 1.54-1.554V24.436c0-.106-.017-.326-.074-.486l-3.512-10.449V1.537c0-.633-.523-1.534-1.52-1.534H5.106V0zm1.54 3.07h32.708v3.663a5.499 5.499 0 0 0-2.553-.614h-9.708c-1.614 0-3.06.687-4.093 1.77a5.648 5.648 0 0 0-4.093-1.77H9.2c-.924 0-1.793.217-2.553.614V3.07zm2.553 6.098h9.708c1.45 0 2.553 1.12 2.553 2.547v.523H6.646v-.523c0-1.426 1.103-2.547 2.553-2.547zm17.894 0H36.8c1.45 0 2.553 1.12 2.553 2.547v.523H24.54v-.523c0-1.426 1.103-2.547 2.553-2.547zm-20.88 6.12H39.79l2.553 7.615H3.656l2.556-7.615zM3.06 25.973h39.88v6.625H3.06v-6.625z"></path></svg>
<div class="search-results-listings-list__item-description__characteristics-popover">Chambres</div>
1
</div>
</div>
I've been trying for a whole day to extract the number of bedrooms (in the above code, it's the 1). However, all my program is returning is null. If anyone has any insights into how I could extract that specific number, I'd appreciate it.
I've tried multiple different approaches, most of them ending with null. One alternative led me to extracting "Chambres" rather than the actual number of bedrooms. This alternative approach also returns null:
bedrooms_info = house_listing.css(
'div.search-results-listings-list__item-description__characteristics__item::text').get()
Solution
You are very very close.
The only key change you really needed was to use getall
instead of get
on your css query.
.search-results-listings-list__item-description__characteristics__item:contains("Chambres") ::text
What your css query says in english is get the text contents of the element with class .search-results-listings-list__item-description__characteristics__item
and also contains a child with the value of "Chambres"
.
So your selector is correct, the only issue is that there are multiple different results for that query and by using get
you only return the first result.
Using getall
will return each of the results in a list for which the one you are looking for is the last of them.
So an example of successfully extracting the "1"
value would be:
html = """
<html>
<div class="search-results-listings-list__item-description__item search-results-listings-list__item-description__characteristics">
<div class="search-results-listings-list__item-description__characteristics__item">
<!--?xml version="1.0"?-->
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 46 41" class="search-results-listings-list__item-description__characteristics__icon search-results-listings-list__item-description__characteristics__icon--bedrooms"><path d="M5.106 0c-.997 0-1.52.904-1.52 1.533v11.965L.074 23.95c-.054.163-.074.38-.074.486V39.2c-.017.814.727 1.554 1.54 1.554.796 0 1.54-.74 1.52-1.554v-3.555h39.88V39.2c-.016.814.724 1.554 1.52 1.554.813 0 1.56-.74 1.54-1.554V24.436c0-.106-.017-.326-.074-.486l-3.512-10.449V1.537c0-.633-.523-1.534-1.52-1.534H5.106V0zm1.54 3.07h32.708v3.663a5.499 5.499 0 0 0-2.553-.614h-9.708c-1.614 0-3.06.687-4.093 1.77a5.648 5.648 0 0 0-4.093-1.77H9.2c-.924 0-1.793.217-2.553.614V3.07zm2.553 6.098h9.708c1.45 0 2.553 1.12 2.553 2.547v.523H6.646v-.523c0-1.426 1.103-2.547 2.553-2.547zm17.894 0H36.8c1.45 0 2.553 1.12 2.553 2.547v.523H24.54v-.523c0-1.426 1.103-2.547 2.553-2.547zm-20.88 6.12H39.79l2.553 7.615H3.656l2.556-7.615zM3.06 25.973h39.88v6.625H3.06v-6.625z"></path></svg>
<div class="search-results-listings-list__item-description__characteristics-popover">Chambres</div>
1
</div>
</div>
</html>
"""
import scrapy
import re
selector = scrapy.Selector(text=html)
bedrooms_info = selector.css('.search-results-listings-list__item-description__characteristics__item:contains("Chambres") ::text').getall()
bedrooms = bedrooms_info[-1] # '\n 1\n '
print(int(re.match(r'\d+', bedrooms).group())) # 1
OUTPUT
1
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.