Issue
I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm
This table has no id
or class
and only contains summary and width. Is there any way to scrape this table?
Perhaps xpath?
I heard that xpath is not compatible with beautifulsoup and hope that is wrong.
<table width="100%" cellpadding="3" border="1" summary="Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo" style="margin-bottom:28px">
<thead>
<tr>
<th scope="col" data-type="numeric" data-toggle="true"> Date </th>
</tr>
</thead>
<tbody>
Here is my code:
import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1,page+1):
l = link + '?page='+str(p)
# Downloading contents of the web page
data = requests.get(l).text
# Creating BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')
tables = soup.find_all('table')
table = soup.find('table', INSERT XPATH EXPRESSION)
df = pd.DataFrame(columns = ['date','brand','descr','reason','company'])
for row in table.tbody.find_all('tr'):
# Find all data for each column
columns = row.find_all('td')
if columns != []:
date = columns[0].text.strip()
Solution
Scraping tables it is best practice to use pandas.read_html()
that covers 95% of all cases. Simply iterate the sites and concat the dataframes
:
import pandas as pd
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
pd.concat(
[pd.read_html(url+'?page='+str(i))[0]for i in range(1,16)],
ignore_index=True
)
Note, that you can also include links via extract_links='body'
This will result in:
Date | Brand Name | Product Description | Reason/Problem | Company | Details/Photo | |
---|---|---|---|---|---|---|
0 | 12/31/2015 | PharMEDium | Norepinephrine Bitartrate added to Sodium Chloride | Discoloration | PharMEDium Services, LLC | nan |
1 | 12/31/2015 | Thomas Produce | Cucumbers | Salmonella | Thomas Produce Company | nan |
2 | 12/28/2015 | Wegmans, Uoriki Fresh | Octopus Salad | Listeria monocytogenes | Uoriki Fresh, Inc. | nan |
... | ||||||
433 | 01/05/2015 | Whole Foods Market | Assorted cookie platters | Undeclared tree nuts | Whole Foods Market | nan |
434 | 01/05/2015 | Eillien's, Blain's Farms and Fleet & more | Walnut Pieces | Salmonella contamination | Eillien’s Candies Inc. | nan |
435 | 01/02/2015 | Full Tilt Ice Cream | Ice Cream | Listeria monocytogenes | Full Tilt Ice Cream | nan |
436 | 01/02/2015 | Zilks | Hummus | Undeclared peanuts | Zilks Foods | nan |
Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:
import requests
from bs4 import BeautifulSoup
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
data = []
for i in range(1,16):
soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
for e in soup.table.select('tr:has(td)'):
data.append({
'date': e.td.text,
'any other': 'column',
'link': e.a.get('href')
})
data
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.