Issue
I have html text that I am trying to clean up using "soup". However, I need to not only identify which text segments are contained by certain span elements of the class='highlight', but also maintain their order in the text.
For example, here example code:
from bs4 import BeautifulSoup
import pandas as pd
original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""
# Parse the HTML content
soup = BeautifulSoup(original_string, 'html.parser')
Desired output (in this case there are 4 text segments):
data = {
'text_order': [0, 1, 2, 3],
'text': ["Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels",
"Their large, ", "cheerful blooms",
"bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry."],
'highlight': [True, False, True, False]
}
df = pd.DataFrame(data)
print(df)
I've tried to extract the span text using "highlight_spans = soup.find_all('span', class_='highlight')" but this does not maintain the order in which the text is displayed in the paragraph.
Solution
Try:
import pandas as pd
from bs4 import BeautifulSoup
original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""
# Parse the HTML content
soup = BeautifulSoup(original_string, "html.parser")
data = []
for i, text in enumerate(soup.p.find_all(string=True)):
data.append(
{
"text_order": i,
"text": text.strip(),
"highlight": bool(text.find_parent(class_="highlight")),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
text_order text highlight
0 0 Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels True
1 1 . Their large, False
2 2 cheerful blooms True
3 3 bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry. False
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.