Issue
Brief description: I am trying to retrieve all listing urls in H&M's dresses page. I have already been able to webscrape the first 500 listing urls. I am not able to webscrape the next ~500.
What I'm trying to do: Retrieve all listings urls in H&M's dresses page.
What I've done: I've implemented the code below and have been able to retrieve all urls from the first 500 product listings (out of ~1000).
The problem: I think there might be some restrictions on how many requests I can make. The obvious answer is to split this up into the first 500 and next 500 requests, but I'm not sure how to do this. Currently, I set the website url's "page-size" parameter in "params" to about ~1000 listings (the actual number of listings on the page), which should return listings 0 to ~1000 (but returns 0-500). The only way I could get listings 500-1000 is if I could set a range for the page-size parameter, but I can't. Any ideas?
**Code: **
import requests
from bs4 import BeautifulSoup
import os
import numpy as np
import json
import re
from ast import literal_eval
import requests
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
import random
headers = headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0"
}
base_url = "https://www2.hm.com/en_us/women/products/dresses.html"
response = requests.get(base_url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
pagination_element = soup.find('div', class_='filter-pagination')
items_text = pagination_element.get_text(strip=True)
num_listings = int(items_text.split()[0])
params = {
"sort": "stock",
"image-size": "small",
"image": "model",
"offset": 0,
"page-size": num_listings
}
response = requests.get(base_url, params=params, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
listing_urls = [a["href"] for a in soup.select(".item-link")]
print(f'Listings collected: {len(listing_urls)}')
print(f"Total number of listings: {num_listings}")
print(f'Listing URLs 1-5: {listing_urls[:5]}')
Solution
You can try to use their Ajax pagination API:
import pandas as pd
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0"
}
url = "https://www2.hm.com/en_us/women/products/dresses/_jcr_content/main/productlisting.display.json"
params = {
"sort": "stock",
"image-size": "small",
"image": "model",
"offset": "0",
"page-size": "36",
}
all_products = []
for offset in range(0, 36 * 3, 36): # <-- increase number of pages here
print(f"{offset=}")
params["offset"] = offset
data = requests.get(url, headers=headers, params=params).json()
all_products.extend(data["products"])
df = pd.DataFrame(all_products)
print(df.head())
Prints:
offset=0
offset=36
offset=72
articleCode onClick link title category image legalText promotionalMarkerText showPromotionalClubMarker showPriceMarker favouritesTracking favouritesSavedText favouritesNotSavedText marketingMarkerText marketingMarkerType marketingMarkerCss price redPrice yellowPrice bluePrice clubPriceText sellingAttribute swatchesTotal swatches preAccessStartDate preAccessEndDate preAccessGroups outOfStockText comingSoon brandName damStyleWith
0 0979874001 setOsaParameters(utag_data.category_id,'SMALL','0979874001'); setNotificationTicket('Oy9wbHAvcHJvZHVjdC1saXN0LXdpdGgtY291bnQvcHJvZHVjdC1saXN0OyM7cHJvZHVjdF9rZXk7MDk3OTg3NF9ncm91cF8wMDFfZW5fdXM7MDk3OTg3NDAwMV9lbl91cztPQkpFQ1RJVkUkO05PTkU6Tk9ORTs5OTs','0979874001'); /en_us/productpage.0979874001.html Sweatshirt Dress ladies_basics_dressesskirts [{'src': '//lp2.hm.com/hmgoepprod?set=source[/84/74/8474cbb3656a7a00c234f451bfbc8a4cc62db7c5.jpg],origin[dam],category[],type[LOOKBOOK],res[m],hmver[1]&call=url[file:/product/style]', 'dataAltImage': '//lp2.hm.com/hmgoepprod?set=source[/5f/55/5f55b8a40c9fe1c49374a6b10a2478a515febf62.jpg],origin[dam],category[ladies_basics_dressesskirts],type[DESCRIPTIVESTILLLIFE],res[m],hmver[2]&call=url[file:/product/style]', 'alt': 'Sweatshirt Dress Model', 'dataAltText': 'Sweatshirt Dress'}] False False Favourites|0979874001|Sweatshirt Dress|LADIES_SHOPBYPRODUCT : DRESSES_DRESSES : VIEWALL_VIEW_ALL SAVED AS FAVORITE SAVE AS FAVORITE $ 24.99 5 [{'colorCode': '#272628', 'articleLink': '/en_us/productpage.0979874001.html', 'colorName': 'Black'}, {'colorCode': '#515359', 'articleLink': '/en_us/productpage.0979874011.html', 'colorName': 'Dark gray'}, {'colorCode': '#86917A', 'articleLink': '/en_us/productpage.0979874013.html', 'colorName': 'Khaki green'}, {'colorCode': '#86917A', 'articleLink': '/en_us/productpage.0979874015.html', 'colorName': 'Khaki green'}] [] H&M
1 1109917007 setOsaParameters(utag_data.category_id,'SMALL','1109917007'); setNotificationTicket('Oy9wbHAvcHJvZHVjdC1saXN0LXdpdGgtY291bnQvcHJvZHVjdC1saXN0OyM7cHJvZHVjdF9rZXk7MTEwOTkxN19ncm91cF8wMDdfZW5fdXM7MTEwOTkxNzAwN19lbl91cztPQkpFQ1RJVkUkO05PTkU6Tk9ORTs5OTs','1109917007'); /en_us/productpage.1109917007.html Rib-knit Dress ladies_dresses_longsleevedress [{'src': '//lp2.hm.com/hmgoepprod?set=source[/ad/79/ad79561ecb2ab9fdb3c77cd0aff996628b63bd6a.jpg],origin[dam],category[],type[LOOKBOOK],res[m],hmver[1]&call=url[file:/product/style]', 'dataAltImage': '//lp2.hm.com/hmgoepprod?set=source[/de/e9/dee94a36a5898f9008680f79070bf271bf6d5790.jpg],origin[dam],category[ladies_dresses_longsleevedress],type[DESCRIPTIVESTILLLIFE],res[m],hmver[2]&call=url[file:/product/style]', 'alt': 'Rib-knit Dress Model', 'dataAltText': 'Rib-knit Dress'}] False False Favourites|1109917007|Rib-knit Dress|LADIES_SHOPBYPRODUCT : DRESSES_DRESSES : VIEWALL_VIEW_ALL SAVED AS FAVORITE SAVE AS FAVORITE $ 39.99 3 [{'colorCode': '#EEEDE1', 'articleLink': '/en_us/productpage.1109917007.html', 'colorName': 'Light beige'}, {'colorCode': '#272628', 'articleLink': '/en_us/productpage.1109917001.html', 'colorName': 'Black'}, {'colorCode': '#4B262D', 'articleLink': '/en_us/productpage.1109917006.html', 'colorName': 'Burgundy'}] [] H&M
2 0979874020 setOsaParameters(utag_data.category_id,'SMALL','0979874020'); setNotificationTicket('Oy9wbHAvcHJvZHVjdC1saXN0LXdpdGgtY291bnQvcHJvZHVjdC1saXN0OyM7cHJvZHVjdF9rZXk7MDk3OTg3NF9ncm91cF8wMjBfZW5fdXM7MDk3OTg3NDAyMF9lbl91cztPQkpFQ1RJVkUkO05PTkU6Tk9ORTs5OTs','0979874020'); /en_us/productpage.0979874020.html Sweatshirt Dress ladies_basics_dressesskirts [{'src': '//lp2.hm.com/hmgoepprod?set=source[/0c/68/0c686143cd3c02a69f8f19f64ab26200983ae51b.jpg],origin[dam],category[ladies_basics_dressesskirts],type[DESCRIPTIVESTILLLIFE],res[m],hmver[2]&call=url[file:/product/style]', 'dataAltImage': '//lp2.hm.com/hmgoepprod?set=source[/0c/68/0c686143cd3c02a69f8f19f64ab26200983ae51b.jpg],origin[dam],category[ladies_basics_dressesskirts],type[DESCRIPTIVESTILLLIFE],res[m],hmver[2]&call=url[file:/product/style]', 'alt': 'Sweatshirt Dress Model', 'dataAltText': 'Sweatshirt Dress'}] False False Favourites|0979874020|Sweatshirt Dress|LADIES_SHOPBYPRODUCT : DRESSES_DRESSES : VIEWALL_VIEW_ALL SAVED AS FAVORITE SAVE AS FAVORITE $ 24.99 New Arrival 5 [{'colorCode': '#DFD8C9', 'articleLink': '/en_us/productpage.0979874020.html', 'colorName': 'Light beige'}, {'colorCode': '#272628', 'articleLink': '/en_us/productpage.0979874001.html', 'colorName': 'Black'}, {'colorCode': '#515359', 'articleLink': '/en_us/productpage.0979874011.html', 'colorName': 'Dark gray'}, {'colorCode': '#86917A', 'articleLink': '/en_us/productpage.0979874013.html', 'colorName': 'Khaki green'}] [] H&M
3 1189024001 setOsaParameters(utag_data.category_id,'SMALL','1189024001'); setNotificationTicket('Oy9wbHAvcHJvZHVjdC1saXN0LXdpdGgtY291bnQvcHJvZHVjdC1saXN0OyM7cHJvZHVjdF9rZXk7MTE4OTAyNF9ncm91cF8wMDFfZW5fdXM7MTE4OTAyNDAwMV9lbl91cztPQkpFQ1RJVkUkO05PTkU6Tk9ORTs5OTs','1189024001'); /en_us/productpage.1189024001.html Balloon-sleeved Satin Dress ladies_dresses_party [{'src': '//lp2.hm.com/hmgoepprod?set=source[/38/71/3871b10536aaa49a40d5254713a7cff4f8f12a96.jpg],origin[dam],category[],type[LOOKBOOK],res[m],hmver[1]&call=url[file:/product/style]', 'dataAltImage': '//lp2.hm.com/hmgoepprod?set=source[/06/2f/062f95cfc20fd9481cdd0bea81f84d1cd18cb9e8.jpg],origin[dam],category[],type[DESCRIPTIVESTILLLIFE],res[m],hmver[2]&call=url[file:/product/style]', 'alt': 'Balloon-sleeved Satin Dress Model', 'dataAltText': 'Balloon-sleeved Satin Dress'}] False False Favourites|1189024001|Balloon-sleeved Satin Dress|LADIES_SHOPBYPRODUCT : DRESSES_DRESSES : VIEWALL_VIEW_ALL SAVED AS FAVORITE SAVE AS FAVORITE $ 37.99 1 [{'colorCode': '#61806E', 'articleLink': '/en_us/productpage.1189024001.html', 'colorName': 'Green'}] [] H&M
4 1193445001 setOsaParameters(utag_data.category_id,'SMALL','1193445001'); setNotificationTicket('Oy9wbHAvcHJvZHVjdC1saXN0LXdpdGgtY291bnQvcHJvZHVjdC1saXN0OyM7cHJvZHVjdF9rZXk7MTE5MzQ0NV9ncm91cF8wMDFfZW5fdXM7MTE5MzQ0NTAwMV9lbl91cztPQkpFQ1RJVkUkO05PTkU6Tk9ORTs5OTs','1193445001'); /en_us/productpage.1193445001.html Twist-detail Satin Dress ladies_dresses_mididresses [{'src': '//lp2.hm.com/hmgoepprod?set=source[/ab/17/ab17aad0e7bc8e0b14d4272b06f06de9406c6baf.jpg],origin[dam],category[],type[LOOKBOOK],res[m],hmver[1]&call=url[file:/product/style]', 'dataAltImage': '//lp2.hm.com/hmgoepprod?set=source[/02/0c/020ce4ced26164e134c5b6aa914c6ba8891e6476.jpg],origin[dam],category[],type[DESCRIPTIVESTILLLIFE],res[m],hmver[2]&call=url[file:/product/style]', 'alt': 'Twist-detail Satin Dress Model', 'dataAltText': 'Twist-detail Satin Dress'}] False False Favourites|1193445001|Twist-detail Satin Dress|LADIES_SHOPBYPRODUCT : DRESSES_DRESSES : VIEWALL_VIEW_ALL SAVED AS FAVORITE SAVE AS FAVORITE $ 64.99 1 [{'colorCode': '#DFD8C9', 'articleLink': '/en_us/productpage.1193445001.html', 'colorName': 'Cream/black striped'}] [] H&M
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.