Issue
I've an html like the following example:
<a class="anchor" id="category-1"></a>
<h2 class="text-muted">First Category</h2>
<div class="row">
<a class="anchor-entry" id="cat1-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<a class="anchor-entry" id="cat1-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat1-third-id"></a>
<div class="col-lg-10">
<h3>Third H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-2"></a>
<h2 class="text-muted">Second Category</h2>
<div class="row">
<a class="anchor-entry" id="cat2-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat2-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-3"></a>
<h2 class="text-muted">Third Category</h2>
<div class="row">
<a class="anchor-entry" id="cat3-first-id"></a>
<div class="col-lg-10">
<h3>Cat-3 First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-second-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-third-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Third H3 Title</h3>
</div>
</div>
</div>
</div>
so there are some blocks not within any div
, but contained between a
with the specific id
.
I've the list of every id I need (category-1
, category-2
) and I would like to get in a python object (dict, dataframe, whatever) all the h3
text for each category:
d = {
'category-1': ['Cat-1 First H3 Title', 'Cat-1 Second H3 Title', 'Cat-1 Third H3 Title'],
'categor-2': ['Cat-2 First H3 Title', 'Cat-2 Second H3 Title']
}
The problem is that I didn't find any method to get in between information:
import requests
from bs4 import BeautifulSoup
url = 'myUrl'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
category_list = ['category-1', 'category-2']
for i in category_list:
# list like: [<a class="anchor" id="category-1"></a>]
catid = soup.find_all(id=i)
# long list like: [<a class="anchor-entry" id="cat1-first-id"></a>, ...]
cata = soup.find_all('a', {'class': 'anchor-entry'})
But catid
and cata
aren't linked and I stopped here.
Solution
Your code will only select a
tags with class anchor-entry
.
category_list = ['category-1', 'category-2', 'category-3']
category_tags = soup.find_all("a", {"class": "anchor"})
d = {}
for i in category_list:
tag = soup.find("a", {"id": i}).find_next()
while tag not in category_tags:
tag = tag.find_next()
if tag is None: break
if tag.name == "h3":
if d.get(i): d[i].append(tag.text)
else: d[i] = [tag.text]
My approach is to traverse the html tree, get h3
headers and store them in d until another category-id is found.
Answered By - phyominh
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.