Issue
Is there a way to extract text with Beautifulsoup that is associated with the most relevant html tag? For example:
<div>
I'm a div
<p>I'm a paragraph</p>
</div>
Is there a way that I end up with
I'm a div
when getting the text from the div tag and I end up with:
I'm a paragraph
when getting the text from the p tag?
I've been working with the code below:
soup = BeautifulSoup(html_description, 'html.parser')
TAGS_TO_APPEND = ['div', 'p', 'h1']
for tag in soup.find_all(True):
if tag.name in TAGS_TO_APPEND:
sanitised_description += tag.get_text(strip=True) + '\n\n' # Add two new lines for <p> tags
elif tag.name == 'li':
sanitised_description += '\n* ' + tag.get_text(strip=True) # Add '*' for <li> tags
Because tag.get_text()
returns all the text within the tag, ie I get "I'm a div I'm a paragraph" when looking at the div tag, I end up with duplicated texts. I also can't just get all the texts at the highest level because I need to reformat the text.
I've looked at multiple threads, one of them being: Show text inside the tags BeautifulSoup, but I don't think it's the same situation as I'm encountering for the solution provided.
Solution
Use string=True
and recursive=False
in tag.find_all()
:
from bs4 import BeautifulSoup
html_text = """\
<div>
I'm a div
<p>I'm a paragraph</p>
</div>"""
soup = BeautifulSoup(html_text, "html.parser")
TAGS_TO_APPEND = ["div", "p", "h1"]
for tag in soup.find_all(TAGS_TO_APPEND):
text = tag.find_all(string=True, recursive=False)
text = " ".join(t for t in map(str.strip, text) if t)
print("TAG =", tag.name)
print("TEXT =", text)
print()
Prints:
TAG = div
TEXT = I'm a div
TAG = p
TEXT = I'm a paragraph
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.