Issue
I'm exploring BeautifulSoup and aiming to retain only specific tags in an HTML file to create a new one.
I can successfully achieve this with the following program. However, I believe there might be a more suitable and natural approach without the need to manually append the strings.
from bs4 import BeautifulSoup
#soup = BeautifulSoup(page.content, 'html.parser')
with open('P:/Test.html', 'r') as f:
contents = f.read()
soup= BeautifulSoup(contents, 'html.parser')
NewHTML = "<html><body>"
NewHTML+="\n"+str(soup.find('title'))
NewHTML+="\n"+str(soup.find('p', attrs={'class': 'm-b-0'}))
NewHTML+="\n"+str(soup.find('div', attrs={'id' :'right-col'}))
NewHTML+= "</body></html>"
with open("output1.html", "w") as file:
file.write(NewHTML)
Solution
You can have a list of desired tags, iterate through them, and use Beautiful Soup's append method to selectively include corresponding elements in the new HTML structure.
from bs4 import BeautifulSoup
with open('Test.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
new_html = BeautifulSoup("<html><body></body></html>", 'html.parser')
tags_to_keep = ['title', {'p': {'class': 'm-b-0'}}, {'div': {'id': 'right-col'}}]
# Iterate through the tags to keep and append them to the new HTML
for tag in tags_to_keep:
# If the tag is a string, find it in the original HTML
# and append it to the new HTML
if isinstance(tag, str):
new_html.body.append(soup.find(tag))
# If the tag is a dictionary, extract tag name and attributes,
# then find them in the original HTML and append them to the new HTML
elif isinstance(tag, dict):
tag_name = list(tag.keys())[0]
tag_attrs = tag[tag_name]
new_html.body.append(soup.find(tag_name, attrs=tag_attrs))
with open("output1.html", "w") as file:
file.write(str(new_html))
Assuming you have an HTML document like the one below (which would have been helpful to include for reproducibility's sake):
<!DOCTYPE html>
<head>
<title>Test Page</title>
</head>
<body>
<p class="m-b-0">Paragraph with class 'm-b-0'.</p>
<div id="right-col">
<p>Paragraph inside the 'right-col' div.</p>
</div>
<p>Paragraph outside the targeted tags.</p>
</body>
</html>
the resulting output1.html
will contain the following content:
<html>
<body>
<title>Test Page</title>
<p class="m-b-0">Paragraph with class 'm-b-0'.</p>
<div id="right-col">
<p>Paragraph inside the 'right-col' div.</p>
</div>
</body>
</html>
Answered By - Andreas Violaris
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.