Wednesday, September 14, 2022

[FIXED] Why does my web scraping function not export the data?

September 14, 2022 beautifulsoup, pandas No comments

Issue

I am currently web scraping a few pages inside a list. I have the following code provided.

pages = {
"https://shop.supervalu.ie/shopping/wine-beer-spirits-germany/c-150410100",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-small-bottles/c-150410110",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-lager/c-150302375", #More than one page
"https://shop.supervalu.ie/shopping/wine-beer-spirits-stout/c-150302380",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-ale/c-150302385",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-lager/c-150302386",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-stout/c-150302387",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-ale/c-150302388", #More than one page
"https://shop.supervalu.ie/shopping/wine-beer-spirits-cider/c-150302389",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-cider/c-150302390",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-alcopops/c-150302395",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-vodka/c-150302430",
"https://shop.supervalu.ie/shopping/wine-beer-spirits-irish-whiskey/c-150302435", #More than one page
}


products = []
prices = []
images = []
urls = []


def export_data():
    logging.info("exporting data to pandas dataframe")

    supervalu = pd.DataFrame({
        'img_url' : images,
        'url' : urls,
        'product' : products,
        'price' : prices
    })

    logging.info("sorting data by price")

    supervalu.sort_values(by=['price'], inplace=True)

    output_json = 'supervalu.json'
    output_csv = 'supervalu.csv'
    output_dir = Path('../../json/supervalu')

    output_dir.mkdir(parents=True, exist_ok=True)

    logging.info("exporting data to json")

    supervalu.to_json(output_dir / output_json)

    logging.info("exporting data to csv")

    supervalu.to_csv(output_dir / output_csv)


def get_data(div):
    raw_data = div.find_all('div', class_='ga-product')
    raw_images = div.find_all('img')
    raw_url = div.find_all('a', class_="ga-product-link")

    product_data = [data['data-product'] for data in raw_data]

    new_data = [d.replace("\r\n","") for d in product_data]

    for name in new_data:
        new_names = re.search(' "name": "(.+?)"', name).group(1)
        products.append(new_names)

    for price in new_data:
        new_prices = re.search(' "price": ''"(.+?)"', price).group(1)
        prices.append(new_prices)

    for image in raw_images:
        new_images = image['data-src']
        images.append(new_images)

    for url in raw_url:
        new_url = url['href']
        urls.append(new_url)


def scrape_page(next_url):
    page = requests.get(next_url)

    if page.status_code != 200:
        logging.error("Page does not exist!")
        exit()

    soup = BeautifulSoup(page.content, 'html.parser') 

    get_data(soup.find(class_="row product-list ga-impression-group"))

    try:
        load_more_text = soup.find('a', class_='pill ajax-link load-more').findAll('span')[-1].text
            
        if load_more_text == 'Load more':
            next_page = soup.find('a', class_="pill ajax-link load-more").get('href')
            logging.info("Scraping next page: {}".format(next_page))
            scrape_page(next_page)
        else:
            export_data()
    except:
        logging.warning("No more next pages to scrape")
        pass

for page in pages:
    logging.info("Scraping page: {}".format(page))
    scrape_page(page)

The main issue that appears is during the try exception handling of the next page. As not all of the pages provided have the the appropriate snippet, a ValueAttribute error will araise hence I have the aforementioned statement closed off in a try exception case. I want to skip the pages that don't have next page and scrape them regardless and continue looping the rest of the pages until a next page arises. All of the pages appear to be looped through but I never get the data exported. If I try the following code:

try:
    load_more_text = soup.find('a', class_='pill ajax-link load-more').findAll('span')[-1].text
        
    if load_more_text == 'Load more':
        next_page = soup.find('a', class_="pill ajax-link load-more").get('href')
        logging.info("Scraping next page: {}".format(next_page))
        scrape_page(next_page)
except:
    logging.warning("No more next pages to scrape")
    pass
else:
    export_data()

This would be the closest that I have gotten to the desired outcome. The above code works and the data gets exported but not all of the pages get exported because as a result - a new dataframe is created for every time a new next page appears and ends i.e. - code iterarets through the list, finds a next page, next page 'pages' get scraped and a new dataframe is created and deletes the previous data.

I'm hoping that someone would give me some guidance on what to do as I have been stuck on this part of my personal project and I'm not so sure on how I am supposed to overcome this obstacle. Thank you in advance.

Solution

I have modified my code as shown below and I have received my desired outcome.

load_more_text = soup.find('a', class_='pill ajax-link load-more')
        
if load_more_text:
    next_page = soup.find('a', class_="pill ajax-link load-more").get('href')
    logging.info("Scraping next page: {}".format(next_page))
    scrape_page(next_page)
else:
    export_data()

Answered By - Dobromil Szczesny Stodulski

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, September 14, 2022

[FIXED] Why does my web scraping function not export the data?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels