Friday, January 14, 2022

[FIXED] Find text between specific id beautifulsoup

January 14, 2022 beautifulsoup, python No comments

Issue

I've an html like the following example:

<a class="anchor" id="category-1"></a>
<h2 class="text-muted">First Category</h2>
<div class="row">
    <a class="anchor-entry" id="cat1-first-id"></a>
    <div class="col-lg-10">
        <h3>First H3 Title</h3>
    </div>
    <a class="anchor-entry" id="cat1-second-id"></a>
    <div class="col-lg-10">
        <h3>Second H3 Title</h3>
    </div>
    <div class="row">
        <a class="anchor-entry" id="cat1-third-id"></a>
        <div class="col-lg-10">
            <h3>Third H3 Title</h3>
        </div>
    </div>
</div>

<a class="anchor" id="category-2"></a>
<h2 class="text-muted">Second Category</h2>
<div class="row">
    <a class="anchor-entry" id="cat2-first-id"></a>
    <div class="col-lg-10">
        <h3>First H3 Title</h3>
    </div>
    <div class="row">
        <a class="anchor-entry" id="cat2-second-id"></a>
        <div class="col-lg-10">
            <h3>Second H3 Title</h3>
        </div>
    </div>
</div>

<a class="anchor" id="category-3"></a>
<h2 class="text-muted">Third Category</h2>
<div class="row">
    <a class="anchor-entry" id="cat3-first-id"></a>
    <div class="col-lg-10">
        <h3>Cat-3 First H3 Title</h3>
    </div>
    <div class="row">
        <a class="anchor-entry" id="cat3-second-id"></a>
        <div class="col-lg-10">
            <h3>Cat-3 Second H3 Title</h3>
        </div>
        <div class="row">
            <a class="anchor-entry" id="cat3-third-id"></a>
            <div class="col-lg-10">
                <h3>Cat-3 Third H3 Title</h3>
            </div>
        </div>
    </div>
</div>

so there are some blocks not within any div, but contained between a with the specific id.

I've the list of every id I need (category-1, category-2) and I would like to get in a python object (dict, dataframe, whatever) all the h3 text for each category:

d = {
    'category-1': ['Cat-1 First H3 Title', 'Cat-1 Second H3 Title', 'Cat-1 Third H3 Title'],
    'categor-2': ['Cat-2 First H3 Title', 'Cat-2 Second H3 Title']
}

The problem is that I didn't find any method to get in between information:

import requests
from bs4 import BeautifulSoup

url = 'myUrl'
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

category_list = ['category-1', 'category-2']


for i in category_list:
    
    # list like: [<a class="anchor" id="category-1"></a>]
    catid = soup.find_all(id=i)

    # long list like: [<a class="anchor-entry" id="cat1-first-id"></a>, ...]
    cata = soup.find_all('a', {'class': 'anchor-entry'})

But catid and cata aren't linked and I stopped here.

Solution

Your code will only select a tags with class anchor-entry.

category_list = ['category-1', 'category-2', 'category-3']
category_tags = soup.find_all("a", {"class": "anchor"})
d = {}

for i in category_list:
    tag = soup.find("a", {"id": i}).find_next()
    while tag not in category_tags:
        tag = tag.find_next()
        if tag is None: break
        if tag.name == "h3":
            if d.get(i): d[i].append(tag.text)
            else: d[i] = [tag.text]

My approach is to traverse the html tree, get h3 headers and store them in d until another category-id is found.

Answered By - phyominh

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 14, 2022

[FIXED] Find text between specific id beautifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels