Friday, December 8, 2023

[FIXED] Identify text with span elements using Python BeautifulSoup

December 08, 2023 beautifulsoup, html, pandas, python No comments

Issue

I have html text that I am trying to clean up using "soup". However, I need to not only identify which text segments are contained by certain span elements of the class='highlight', but also maintain their order in the text.

For example, here example code:

from bs4 import BeautifulSoup
import pandas as pd

original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""

# Parse the HTML content
soup = BeautifulSoup(original_string, 'html.parser')

Desired output (in this case there are 4 text segments):

data = {
    'text_order': [0, 1, 2, 3],
    'text': ["Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels",
             "Their large, ", "cheerful blooms", 
             "bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry."],
    'highlight': [True, False, True, False]
}

df = pd.DataFrame(data)
print(df)

I've tried to extract the span text using "highlight_spans = soup.find_all('span', class_='highlight')" but this does not maintain the order in which the text is displayed in the paragraph.

Solution

Try:

import pandas as pd
from bs4 import BeautifulSoup

original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""

# Parse the HTML content
soup = BeautifulSoup(original_string, "html.parser")

data = []
for i, text in enumerate(soup.p.find_all(string=True)):
    data.append(
        {
            "text_order": i,
            "text": text.strip(),
            "highlight": bool(text.find_parent(class_="highlight")),
        }
    )

df = pd.DataFrame(data)
print(df)

Prints:

   text_order                                                                                                                                                                                                                                                                                                text  highlight
0           0                                                                                                                                                                                                                Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels       True
1           1                                                                                                                                                                                                                                                                                      . Their large,      False
2           2                                                                                                                                                                                                                                                                                     cheerful blooms       True
3           3  bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.      False

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 8, 2023

[FIXED] Identify text with span elements using Python BeautifulSoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels