Tuesday, January 30, 2024

[FIXED] How to scrape table using beautifulsoup only summary and width?

January 30, 2024 beautifulsoup, python, web-scraping, xpath No comments

Issue

I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

This table has no id or class and only contains summary and width. Is there any way to scrape this table? Perhaps xpath?

I heard that xpath is not compatible with beautifulsoup and hope that is wrong.

<table width="100%" cellpadding="3" border="1" summary="Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo" style="margin-bottom:28px">
          <thead>
            <tr>
                    <th scope="col" data-type="numeric" data-toggle="true"> Date </th>
            </tr>
          </thead>
          <tbody>

Here is my code:

import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1,page+1):
   l = link + '?page='+str(p)
    # Downloading contents of the web page
    data = requests.get(l).text
    # Creating BeautifulSoup object
    soup = BeautifulSoup(data, 'html.parser')
    tables = soup.find_all('table')
    table = soup.find('table', INSERT XPATH EXPRESSION)
    df = pd.DataFrame(columns = ['date','brand','descr','reason','company'])
    for row in table.tbody.find_all('tr'):    
        # Find all data for each column
        columns = row.find_all('td')
        if columns != []:
            date = columns[0].text.strip()

Solution

Scraping tables it is best practice to use pandas.read_html() that covers 95% of all cases. Simply iterate the sites and concat the dataframes:

import pandas as pd

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

pd.concat(
    [pd.read_html(url+'?page='+str(i))[0]for i in range(1,16)],
    ignore_index=True
)

Note, that you can also include links via extract_links='body'

This will result in:

	Date	Brand Name	Product Description	Reason/Problem	Company	Details/Photo
0	12/31/2015	PharMEDium	Norepinephrine Bitartrate added to Sodium Chloride	Discoloration	PharMEDium Services, LLC	nan
1	12/31/2015	Thomas Produce	Cucumbers	Salmonella	Thomas Produce Company	nan
2	12/28/2015	Wegmans, Uoriki Fresh	Octopus Salad	Listeria monocytogenes	Uoriki Fresh, Inc.	nan
...
433	01/05/2015	Whole Foods Market	Assorted cookie platters	Undeclared tree nuts	Whole Foods Market	nan
434	01/05/2015	Eillien's, Blain's Farms and Fleet & more	Walnut Pieces	Salmonella contamination	Eillien’s Candies Inc.	nan
435	01/02/2015	Full Tilt Ice Cream	Ice Cream	Listeria monocytogenes	Full Tilt Ice Cream	nan
436	01/02/2015	Zilks	Hummus	Undeclared peanuts	Zilks Foods	nan

Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:

import requests
from bs4 import BeautifulSoup

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

data = []

for i in range(1,16):
    soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
    for e in soup.table.select('tr:has(td)'):
        data.append({
            'date': e.td.text,
            'any other': 'column',
            'link': e.a.get('href')
        })

data

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 30, 2024

[FIXED] How to scrape table using beautifulsoup only summary and width?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels