Issue
I have an html structured like the exemple below:
<div class="container">
<div class="row">
<div class="otherattr">
<div id="listalbum">
<div id="9067" class="album">album: <b>"Name of the album"</b> (2001)</div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div id=91453, class="album">album: <b>"other Name of album"</b> (2007) </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div id=56739, class="album">album: <b>"another album"</b> (2012) </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
<div class="listalbum_item"> </div>
</div>
</div>
</div>
</div>
my goal would be to extract tags with id = somenumber (that number btw is always different for every tag), and extract also the other tags with listalbum_item -for the sake of simplicity we suppose that into those tags there is some text or some link, doesn't matter-
As you see this html is organized to show the title of an album, and all the songs in that album. I want to create some structure (we say a dictionary) like:
dix = {'album_1' : ['song1','song2','song3','song4'] , 'album_3' : ['song1','song2','song3','song4']}
how can you do it? the problem for me is that 'id' has a number that changes everytime ( and this is just an exemple but I need to parse a very big site with lots of artist (and consequently with lots of albums and songs) and I have problems organizing the data in and ordinate way. I was only able to create a list with all song tag, but I need to separate the song for the album they belong to
thanks a lot!!!
Solution
You need to First identify the album and then search for find_next_siblings()
and then verify that find_previous_sibling()
match with album attribute ID
Code:
data='''<div class="container">
<div class="row">
<div class="otherattr">
<div id="listalbum">
<div id="9067" class="album">album: <b>"Name of the album"</b> (2001)</div>
<div class="listalbum_item">song1</div>
<div class="listalbum_item">song2</div>
<div class="listalbum_item">song3</div>
<div id="91453" class="album">album: <b>"other Name of album"</b> (2007) </div>
<div class="listalbum_item">song1</div>
<div class="listalbum_item">song4</div>
<div class="listalbum_item">song2</div>
<div class="listalbum_item">song3</div>
<div id="56739" class="album">album: <b>"another album"</b> (2012) </div>
<div class="listalbum_item">song5</div>
<div class="listalbum_item">song1</div>
<div class="listalbum_item">song3</div>
<div class="listalbum_item">song2</div>
<div class="listalbum_item">song4</div>
</div>
</div>
</div>
</div>'''
album={}
soup=BeautifulSoup(data,"html.parser")
for item in soup.select("#listalbum > .album"):
name=item.find_next('b').text
songs = []
for song in item.find_next_siblings('div',class_="listalbum_item"):
if song.find_previous_sibling('div',class_='album')['id'] == item['id']:
songs.append(song.text)
album[name]=songs
print(album)
Output:
{'"Name of the album"': ['song1', 'song2', 'song3'], '"another album"': ['song5', 'song1', 'song3', 'song2', 'song4'], '"other Name of album"': ['song1', 'song4', 'song2', 'song3']}
Answered By - KunduK
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.