Friday, November 17, 2023

[FIXED] Parsing web page in python using Beautiful Soup

November 17, 2023 beautifulsoup, python, urllib No comments

Issue

I have some troubles with getting the data from the website. The website source is here:

view-source:http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO

there's sth like this:

INFORMACJE O FILMIE
Tytuł............................................: La mer à boire
Ocena.............................................: IMDB - 6.3/10 (24)
Produkcja.........................................: Francja
Gatunek...........................................: Dramat
Czas trwania......................................: 98 min.
Premiera..........................................: 22.02.2012 - Świat
Reżyseria........................................: Jacques Maillot
Scenariusz........................................: Pierre Chosson, Jacques Maillot
Aktorzy...........................................: Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel

And I want to get the data from this website to have a Python list of strings:

[[Tytuł, "La mer à boire"]
[Ocena, "IMDB - 6.3/10 (24)"]
[Produkcja, Francja]
[Gatunek, Dramat]
[Czas trwania, 98 min.]
[Premiera, "22.02.2012 - Świat"]
[Reżyseria, "Jacques Maillot"]
[Scenariusz, "Pierre Chosson, Jacques Maillot"]
[Aktorzy, "Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel"]]

I wrote some code using BeautifulSoup but I cant go any further, I just don't know what to get the rest from the website source and how to convert is to string ... Please, help!

My code:

    # -*- coding: utf-8 -*-
#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

try :
    web_page = urllib2.urlopen("http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO").read()
    soup = BeautifulSoup(web_page)
    c = soup.find('span', {'class':'vi'}).contents
    print(c)
except urllib2.HTTPError :
    print("HTTPERROR!")
except urllib2.URLError :
    print("URLERROR!")

Solution

The secret of using BeautifulSoup is to find the hidden patterns of your HTML document. For example, your loop

for ul in soup.findAll('p') :
    print(ul)

is in the right direction, but it will return all paragraphs, not only the ones you are looking for. The paragraphs you are looking for, however, have the helpful property of having a class i. Inside these paragraphs one can find two spans, one with the class i and another with the class vi. We are lucky because those spans contains the data you are looking for:

<p class="i">
    <span class="i">Tytuł............................................</span>
    <span class="vi">: La mer à boire</span>
</p>

So, first get all the paragraphs with the given class:

>>> ps = soup.findAll('p', {'class': 'i'})
>>> ps
[<p class="i"><span class="i">Tytuł... <LOTS OF STUFF> ...pan></p>]

Now, using list comprehensions, we can generate a list of pairs, where each pair contains the first and the second span from the paragraph:

>>> spans = [(p.find('span', {'class': 'i'}), p.find('span', {'class': 'vi'})) for p in ps]
>>> spans
[(<span class="i">Tyt... ...</span>, <span class="vi">: La mer à boire</span>), 
 (<span class="i">Ocena... ...</span>, <span class="vi">: IMDB - 6.3/10 (24)</span>),
 (<span class="i">Produkcja.. ...</span>, <span class="vi">: Francja</span>),
 # and so on
]

Now that we have the spans, we can get the texts from them:

>>> texts = [(span_i.text, span_vi.text) for span_i, span_vi in spans]
>>> texts
[(u'Tytu\u0142............................................', u': La mer \xe0 boire'),
 (u'Ocena.............................................', u': IMDB - 6.3/10 (24)'),
 (u'Produkcja.........................................', u': Francja'), 
  # and so on
]

Those texts are not ok still, but it is easy to correct them. To remove the dots from the first one, we can use rstrip():

>>> u'Produkcja.........................................'.rstrip('.')
u'Produkcja'

The : string can be removed with lstrip():

>>> u': Francja'.lstrip(': ')
u'Francja'

To apply it to all content, we just need another list comprehension:

>>> result = [(text_i.rstrip('.'), text_vi.replace(': ', '')) for text_i, text_vi in texts]
>>> result
[(u'Tytu\u0142', u'La mer \xe0 boire'),
 (u'Ocena', u'IMDB - 6.3/10 (24)'),
 (u'Produkcja', u'Francja'),
 (u'Gatunek', u'Dramat'),
 (u'Czas trwania', u'98 min.'),
 (u'Premiera', u'22.02.2012 - \u015awiat'),
 (u'Re\u017cyseria', u'Jacques Maillot'),
 (u'Scenariusz', u'Pierre Chosson, Jacques Maillot'),
 (u'Aktorzy', u'Daniel Auteuil, Maud Wyler, Yann Tr&eacute;gou&euml;t, Alain Beigel'),
 (u'Wi\u0119cej na', u':'),
 (u'Trailer', u':Obejrzyj zwiastun')]

And that is it. I hope this step-by-step example can make the use of BeautifulSoup clearer for you.

Answered By - brandizzi

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 17, 2023

[FIXED] Parsing web page in python using Beautiful Soup

Issue

INFORMACJE O FILMIE

Solution

0 comments:

Post a Comment

Popular Posts

Labels