Monday, December 25, 2023

[FIXED] html scraping returns empty text and value fileds

December 25, 2023 beautifulsoup, html, python, web-scraping No comments

Issue

I am trying to scrape the available rooms from this booking website. My Attempt:

import requests
from bs4 import BeautifulSoup

#FORMAT: TT_tt_MMM_jjjj
#At max 2 weeks in advance
date = "Sa 23 Sep 2023"
min_space = 3
only_big_rooms = False
hut_id = 92


URL = "https://www.alpsonline.org/reservation/calendar?hut_id=" + str(hut_id)


response = requests.get(URL)
#print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')

for i in range(0,13):
    bookingID = "bookingDate" + str(i)
    block = soup.find('div', {'id':bookingID})
    if (block.text == date):
        print("found it once!")
    print(block.text)
    print(block)

#also failed: 
for i in range(0,13):
    bookingID = "bookingDateHidden" + str(i)
    block = soup.find('input', {'id':bookingID})
    if (block.text == date):
        print("found it once!")
    print(block.value)
    print(block)

When running the code, all the labels and texts the code gets from the site are empty. So my console output looks like this:

moritz@cupid:~$ python3 Alpines_scraper.py 
<div class="main-label" id="bookingDate0"></div>
<div class="main-label" id="bookingDate1"></div>
<div class="main-label" id="bookingDate2"></div>
<div class="main-label" id="bookingDate3"></div>
<div class="main-label" id="bookingDate4"></div>
<div class="main-label" id="bookingDate5"></div>
<div class="main-label" id="bookingDate6"></div>
<div class="main-label" id="bookingDate7"></div>
<div class="main-label" id="bookingDate8"></div>
<div class="main-label" id="bookingDate9"></div>
<div class="main-label" id="bookingDate10"></div>
<div class="main-label" id="bookingDate11"></div>
<div class="main-label" id="bookingDate12"></div>
None
<input id="bookingDateHidden0" type="hidden"/>
None
<input id="bookingDateHidden1" type="hidden"/>
None
<input id="bookingDateHidden2" type="hidden"/>
...

After a bit of research, I found out, that my site probably didn't load the content at the stage of the request. Then I found this SO post, but couldn't really wrap my head around it. Since this is my first try at web scraping.

If someone is willing to help me, or knows of a tutorial, that deals with this kind of problem, I'd be happy to hear from you.

Solution

Generally when a page is loaded dynamically you should use Selenium, but sometimes you can work out where the data is coming from by looking at the Network tab in developer tools. Once you've found it you can just make a request which is faster than using Selenium.

For the website you're trying to scrape, it calls $.getJSON('/reservation/selectDate?date=' + dateText, ... to load the data. In the Network tab, you can see a request for /reservation/selectDate?date=12.09.2023.

If you visit that URL without first visiting /reservation/calendar you get an error. If you try to scrape that URL, it will return a page about cookies not being enabled. If you look at the cookies the page has set in Application > Cookies in developer tools, you can see there are JSESSIONID and SRVGROUP cookies. The first of these seems to affect the result. To get a JSESSIONID you can make a HEAD request to /reservation/calendar?hut_id=92 and then request /reservation/selectDate with that session id. A HEAD request is used because it will get the cookie without having to load the whole page. This will then return JSON which you can then parse.

This can be done as follows:

import requests
from datetime import datetime

ROOM_MAP = {4: "large rooms", 5: "shared bedrooms"}

today = datetime.now().strftime("%d.%m.%Y")

r = requests.head("https://www.alpsonline.org/reservation/calendar?hut_id=92")
session_id = r.cookies.get("JSESSIONID")

if not session_id:
    print("Couldn't get JSESSIONID")
    exit(1)

cookies = {
    "JSESSIONID": session_id
}

r = requests.get("https://www.alpsonline.org/reservation/selectDate",
                 params = {"date": today}, cookies = cookies)

# Check we have recieved JSON
if r.headers["Content-Type"].startswith("application/json"):
    for date in r.json().values():
        print(date[0]["reservationDate"])
        for room_type in date:
            print(room_type["freeRoom"], ROOM_MAP[room_type["bedCategoryId"]])
else:
    print("Scrape failed")

Edit - Dynamically generating ROOM_MAP

You can generate ROOM_MAP for each page by using beautiful soup to parse it. You can install it with pip install bs4.

Looking at the page source, the room type names are already there. For hut_id=297 for example, #bedCategoryLabel0-7 is Large rooms. The first number is the key from the JSON (which we can ignore) and the second is the bedCategoryId. The CSS selector [id^=bedCategoryLabel0-] will get all elements whose id begins with bedCategoryLabel0-, which will be the name of each room type. The mapping can then be generated from the number at the end of the id and the text in the element. To add this to the original code, first add from bs4 import BeautifulSoup, then removed the hardcoded ROOM_MAP and then change the first request to

today = datetime.now().strftime("%d.%m.%Y")
# BEGIN CHANGES
r = requests.get("https://www.alpsonline.org/reservation/calendar",
                 params={"hut_id": hut_id})
soup = BeautifulSoup(r.content, "html.parser")
ROOM_MAP = {int(e["id"].split("-")[1]): e.get_text(strip=True) for e in soup.select("[id^=bedCategoryLabel0-]")}
# END CHANGES
session_id = r.cookies.get("JSESSIONID")

Notice that the HEAD has been replaced with GET because we need the content. The content is parsed with beautiful soup which then finds the relevant elements using .select. The id is split to get the number at the end (the bedCategoryId) which is mapped to the text of the element with whitespace removed. ROOM_MAP can then be used as before.

Answered By - Henry

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 25, 2023

[FIXED] html scraping returns empty text and value fileds

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels