Issue
I'm new to bs4 and trying to scrape data from table class="ids_table"
. Here is an HTML example:
<div class="table-wrap">
<table class="ids_table"><tbody>
<tr>
<td class="ids_td"><b>First string</b></td>
<td class="ids_td"><b>Second string</b></td>
<td class="ids_td"><b>Third string</b></td>
<td class="ids_td"><b>Fourth string</b></td>
</tr>
<tr>
<td class="ids_td">d</td>
<td class="ids_td"> <b>LLLM2001</b></td>
<td class="ids_td"> <font color="#00875a"><b>12-July-2022</b></font> </td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">e</td>
<td class="ids_td"> <b>MLLL0056</b></td>
<td class="ids_td"> <font color="#00875a"><b>11-June-2022</b></font></td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">f</td>
<td class="ids_td"> <del>AMMK0001</del><br>
<font color="#00875a"><b>MMKA0001</b></font></td>
<td class="ids_td"><font color="#00875a"> <b>12 July 2022</b></font></td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">i</td>
<td class="ids_td"> <font color="#00875a"><b>ANJK1111</b></font></td>
<td class="ids_td"> <font color="#00875a"><b>11-June-2022</b></font></td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">j</td>
<td class="ids_td"> <font color="#00875a"><b>YMLC3939</b></font></td>
<td class="ids_td"> <font color="#00875a"><b>11-June-2022</b></font></td>
<td class="ids_td"> </td>
</tr>
</tbody></table>
</div>
I want to:
- Scrap all
font
values fromtable
. - Always scrap "First string"..."Fourth string" from table header (they are also in
td
, but always have the same position and values). - Ignore
del
intd
(not necessary) - Left blank only for IDs, that are not in
font
(by IDs I mean LLLM2001, MLLL0056 etc.).
Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all("table", {"class": "ids_table"})
data = [[x for x in v.find_all('td')[1:-1]] for v in table]
data = [[x.text.strip() if x.find('font') else '' for x in c] for c in data]
data:
[['',
'',
'',
'',
'',
'12-July-2022',
'',
'',
'',
'11-June-2022',
'',
'',
'AMMK0001\n \xa0MMKA0001',
'12 July 2022',
'',
'',
'ANJK1111',
'11-June-2022',
'',
'',
'YMLC3939',
'11-June-2022']]
As a result I want to get:
[['First string',
'Second string',
'Third string',
'Fourth sting',
'd'
'',
'12-July-2022',
'e'
'',
'11-June-2022',
'f'
'MMKA0001',
'12 July 2022',
'i'
'ANJK1111',
'11-June-2022',
'j'
'YMLC3939',
'11-June-2022']]
Thank you in advance
Solution
I don't really understand your 4th point, nonetheless:
from bs4 import BeautifulSoup
import pandas as pd
html = '''<div class="table-wrap">
<table class="ids_table"><tbody>
<tr>
<td class="ids_td"><b>First string</b></td>
<td class="ids_td"><b>Second string</b></td>
<td class="ids_td"><b>Third string</b></td>
<td class="ids_td"><b>Fourth string</b></td>
</tr>
<tr>
<td class="ids_td">d</td>
<td class="ids_td"> <b>LLLM2001</b></td>
<td class="ids_td"> <font color="#00875a"><b>12-July-2022</b></font> </td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">e</td>
<td class="ids_td"> <b>MLLL0056</b></td>
<td class="ids_td"> <font color="#00875a"><b>11-June-2022</b></font></td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">f</td>
<td class="ids_td"> <del>AMMK0001</del><br> <font color="#00875a"><b>MMKA0001</b></font></td>
<td class="ids_td"><font color="#00875a"> <b>12 July 2022</b></font></td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">i</td>
<td class="ids_td"> <font color="#00875a"><b>ANJK1111</b></font></td>
<td class="ids_td"> <font color="#00875a"><b>11-June-2022</b></font></td>
<td class="ids_td"> </td>
</tr>
<tr>
<td class="ids_td">j</td>
<td class="ids_td"> <font color="#00875a"><b>YMLC3939</b></font></td>
<td class="ids_td"> <font color="#00875a"><b>11-June-2022</b></font></td>
<td class="ids_td"> </td>
</tr>
</tbody></table>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
# remove the <del> tag
to_be_deleted = soup.select('del')[0]
to_be_deleted.decompose()
# this is how you remove the values from the 'second string' column which are not wrapped in <font> tag
tds = soup.select('b')
for x in tds[4:]:
if x.parent.name == 'td':
x.decompose()
# this is how you get first row - headers
table_headers = [x.text.strip() for x in soup.select_one("table.ids_table").select('tr')[0].select('td')]
# this is how you get the fonts
fonts = [x.text.strip() for x in soup.select_one("table.ids_table").select('font')]
# this is how you display the data in an intelligible way
df = pd.read_html(str(soup))[0]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
df
This returns:
First string Second string Third string Fourth string
1 d NaN 12-July-2022 NaN
2 e NaN 11-June-2022 NaN
3 f MMKA0001 12 July 2022 NaN
4 i ANJK1111 11-June-2022 NaN
5 j YMLC3939 11-June-2022 NaN
Answered By - platipus_on_fire
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.