Issue
I am trying to scrape total number of rooms available according to room type at hotel's booking.com page but the problem is there are multiple select classes under that showing the number of rooms available at the moment and I want only the first select class from each room type. Here is the link and here is the html source code:
<td class=" hprt-table-cell hprt-table-room-select ">
<div class="hprt-block">
<label for="hprt_nos_select_7711208_105663500_0_1_0"><span class="invisible_spoken">Select rooms</span></label>
<select
class="hprt-nos-select js-hprt-nos-select"
name="nr_rooms_7711208_105663500_0_1_0"
data-component="hotel/new-rooms-table/select-rooms"
data-room-id="7711208"
data-block-id="7711208_105663500_0_1_0"
data-is-fflex-selected="0"
data-testid="select-room-trigger"
id="hprt_nos_select_7711208_105663500_0_1_0"
aria-describedby="room_type_id_7711208 rate_price_id_7711208_105663500_0_1_0 rate_policies_id_7711208_105663500_0_1_0"
>
<option value="0">
0
</option>
<option value="1">
1
(PKR 130,413)
</option>
<option value="2">
2
(PKR 260,825)
</option>
<option value="3">
3
(PKR 391,238)
</option>
<option value="4">
4
(PKR 521,650)
</option>
<option value="5">
5
(PKR 652,063)
</option>
<option value="6">
6
(PKR 782,475)
</option>
<option value="7">
7
(PKR 912,888)
</option>
<option value="8">
8
(PKR 1,043,300)
</option>
<option value="9">
9
(PKR 1,173,713)
</option>
<option value="10">
10
(PKR 1,304,125)
</option>
</select>
</div>
</td>
</tr>
I tried with the conventional parsing with BeautifulSoup but I am getting all the dropdown values:
import requests
headers = {
'authority': 'www.booking.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en,ru;q=0.9',
'cache-control': 'max-age=0',
# 'cookie': 'px_init=0; px_init=0; px_init=0; bkng_sso_session=e30; cors_js=1; OptanonConsent=implicitConsentCountry=nonGDPR&implicitConsentDate=1681743126414; _gid=GA1.2.1780790249.1681743127; BJS=-; _pxvid=6c31855f-dd2f-11ed-a917-4e68496f4d4b; _gcl_au=1.1.112323773.1681743130; _scid=229df030-2f68-4bd0-b156-7063f0145941; _pin_unauth=dWlkPU16Y3dNbVF6T0dNdE1tWTBOaTAwWTJSa0xUazVZakV0TXprek5XWm1ZV014WWprMw; _sctr=1%7C1681671600000; b=%7B%22countLang%22%3A4%7D; _gac_UA-116109-18=1.1681744759.CjwKCAjw3POhBhBQEiwAqTCuBoX_51XaQoBEeU_vfDuKDmXAjW8jLEbT8R7tq1wdrJL6KpXQsfNA4xoC4DEQAvD_BwE; _gcl_aw=GCL.1681744761.CjwKCAjw3POhBhBQEiwAqTCuBoX_51XaQoBEeU_vfDuKDmXAjW8jLEbT8R7tq1wdrJL6KpXQsfNA4xoC4DEQAvD_BwE; bkng_sso_ses=eyJib29raW5nX2dsb2JhbCI6W3siYSI6MSwiaCI6IjhzQXVjY1d5TERtc2Z5Qkx0MmRoRDVoSTk2Y1l6bUs3M0t2a09KSm5QckkifV19; bkng_sso_auth=CAIQARqEATEvPMc6AHwUEZvMw4J/8YklCkAg36aw5sLRh/H86hPARPQzeLVq/nBP5sTaXqr07pN2Ab1Mmi5R19Dv5D1jt0BN/JDzhkghfxoBBnu5V/XdrPyRvfXLgTGbi59fPQZ3vnaq6U5BXUQXBNsblbmPI1CPeF9Ru3TDPkOxycvKo4wnQ2Pe6g==; bkng_sso_auth_1681832614=CAIQARqEATEvPMc6AHwUEZvMw4J/8YklCkAg36aw5sLRh/H86hPARPQzeLVq/nBP5sTaXqr07pN2Ab1Mmi5R19Dv5D1jt0BN/JDzhkghfxoBBnu5V/XdrPyRvfXLgTGbi59fPQZ3vnaq6U5BXUQXBNsblbmPI1CPeF9Ru3TDPkOxycvKo4wnQ2Pe6g==; pxcts=ca12cab0-ddff-11ed-9fb5-7a696b416e65; _pxhd=m46ietwzHE0KPFcO%2FW8mTbqqF%2FcyHaDRq0EXc51i8O7gVmsl6vUyiTHTNhi7Us33okAykNXiLzHhvbNgAdwo6w%3D%3D%3AfmGY7mxZ1hNe3x07dauU-Ne2oBr8cGPiIFZUUXCkWC69duHocPv0KFdHymel2PP1Y2enqA8SeESGeV7eUDyj-JZYB6TGpE7AnKCRsXA216o%3D; bkng_prue=1; g_state={"i_p":1681919052255,"i_l":2}; 11_srd=%7B%22features%22%3A%5B%7B%22id%22%3A9%7D%5D%2C%22score%22%3A3%2C%22detected%22%3Afalse%7D; _scid_r=229df030-2f68-4bd0-b156-7063f0145941; _ga=GA1.1.713886904.1681743127; _derived_epik=dj0yJnU9NEJfWVVGcWNmNWdUc2MtZl9BOVJfSjN4akVjX3pZZDImbj0tY2MxOFYyU1kwTWx1cUprYXdteEJ3Jm09OCZ0PUFBQUFBR1EteTdFJnJtPTgmcnQ9QUFBQUFHUS15N0Umc3A9NQ; _uetsid=6e37b7a0dd2f11edbcd7b5787f5f4105; _uetvid=bd8565c043f911edaa654360921b70d6; _ga_FPD6YLJCJ7=GS1.1.1681835994.3.1.1681836988.0.0.0; bkng=11UmFuZG9tSVYkc2RlIyh9Yaa29%2F3xUOLbKE7bjkbYWzlZVOt73Ae4DLnjBW58p%2FU623mlquFQJ3RIz%2Bzw7Y4krv7gVyeCmeGPuNyEMFlfjZImfWmkko82%2B42f3Z0fCyOQmcn1FTTY911Kk8zkS6F%2FxxPaZExjSwq8wSEEO4Ncu81x7iQdDYB1cQT6iAlINUyTzQCPDKM8cpg%3D; _px3=ac834449c9b89fb1d80807b2633d16b99eaaab2c6660962d6ae3cc35b759432a:gkxp6/KH6Yemcpe9TK0KCaS+0Y6yjqjgApISMdA0eds9e+rmN8zCEQDOMlVczT3/Kat0mYjuTknyGrn7zW9qfw==:1000:3L//ZAiwJK2fw7/EGikj756ZBaHz5JX20iIwM/FTWwpPsGSmUUPF3EjBNlEGQYeMSidSYcdJkHggYtsv7klcuZmjJRjX9nq0QmPCcO2qmrWlscPd2hGRpAhGcpT4iA7IYRqu3DMJqcEI6nu4KLfX5AfhetkwR/cm4JJxFD90pPH1G2tl2hFrfpa9b060rzZKVR8/T1I4ozaFQcCackRF+g==; _pxde=50b64dac12bc11f5f5f6c4f2452f81227d13cc432a5c3dfa7219699e1232f77a:eyJ0aW1lc3RhbXAiOjE2ODE4MzcwNDE4NDYsImZfa2IiOjAsImlwY19pZCI6W119; lastSeen=0',
'sec-ch-ua': '"Chromium";v="110", "Not A(Brand";v="24", "YaBrowser";v="23"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 YaBrowser/23.3.1.906 (beta) Yowser/2.5 Safari/537.36',
}
response = requests.get(
'https://www.booking.com/hotel/cz/benica.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaLUBiAEBmAEJuAEZyAEM2AEB6AEB-AEMiAIBqAIDuAKn9fqhBsACAdICJDk5NjkxYzcxLWM1ZGMtNDkyMC1hYTMzLWViODA3MmY2YzQ0NtgCBuACAQ&sid=17bbe3350ba8fd3998e8883df7bb3fdc&all_sr_blocks=7711208_105663500_0_1_0;checkin=2023-06-25;checkout=2023-06-30;dist=0;group_adults=2;group_children=0;hapos=1;highlighted_blocks=7711208_105663500_0_1_0;hpos=1;matching_block_id=7711208_105663500_0_1_0;no_rooms=1;req_adults=2;req_children=0;room1=A%2CA;sb_price_type=total;sr_order=distance_from_search;sr_pri_blocks=7711208_105663500_0_1_0__42000;srepoch=1681836060;srpvid=473c754d9f8a0016;type=total;ucfs=1&',
headers=headers,
)
soup = BeautifulSoup(response.text, 'html.parser')
availabelRooms = []
empty_rooms = soup.select('select.hprt-nos-select.js-hprt-nos-select')
for e in empty_rooms:
availabelRooms.append(len(e.select('option'))-1)
Above code outputs:
[10, 10, 10, 4, 4, 1, 1, 1, 1, 2, 2]
And my expected output is:
[10, 4, 1, 1, 2]
Solution
So from what I can see, each dropdown box has it's own row. Then each room type has mutliple rows on the last row of a room type, the class for the div is different
Apologies for bad formatting, copy pasted from console
<tr data-block-id="7711201_352735385_0_1_0" data-hotel-rounded-price="280" class="js-rt-block-row e2e-hprt-table-row hprt-table-last-row ">
vs for a non ending row:
<tr data-block-id="7711201_105663500_0_1_0" data-hotel-rounded-price="250" class="js-rt-block-row e2e-hprt-table-row ">
As you can see the class is different. If we assume that each row of a room type has the same number of rooms available, then you can simply do exactly what you're doing now but select the ending row of each room type:
selector = "#hprt-table > tbody tr.e2e-hprt-table-row td div.hprt-block select"
soup = BeautifulSoup(response.text, 'html.parser')
availabelRooms = []
empty_rooms = soup.select(selector)
for e in empty_rooms:
availabelRooms.append(len(e.select('option'))-1)
Answered By - mrblue6
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.