Saturday, January 13, 2024

[FIXED] How to select two td and output as single line with bs4?

January 13, 2024 beautifulsoup, python-3.x, web-scraping No comments

Issue

I want to fetch some data and I have a hard time selecting two td and having the output on the same line where they belong.

Sample of the HTML:

<tr>
<td class ='verseNumCell'>
፳
</td>
<td class ='verseConentCell'>
ወትቤሎን ኢትስምያኒ ኖሔሚን ስምያኒ መራር እስመ መረርኩ ፈድፋደ ወብዙኀ ።
</td>
</tr>
<tr>
<td class ='verseNumCell'>
፳፩
</td>
<td class ='verseConentCell'>
አንሰ ምልእትየ ሖርኩ ወዕራቅየ አግብአኒ <span class='divineWord'>እግዚአብሔር</span> ለምንት ትብላኒ ኖሔሚን እንዘ <span class='divineWord'>እግዚአብሔር</span> አኅሰረኒ ወፈድፋደ አሕመመኒ ።
</td>
</tr>
<tr>

What I did:

import bs4
import requests
import re

url = "https://www.ethiopicbible.com/books/%E1%8A%A6%E1%88%AA%E1%89%B5-%E1%8B%98%E1%8D%8D%E1%8C%A5%E1%88%A8%E1%89%B5-1"
parameters = {}
response = requests.get(url, params=parameters)
soup = bs4.BeautifulSoup(response.text, "html.parser")
element_list = soup.find("div", class_="geezBibleChapterContainer").find_all("td")

for element in element_list:
    text = element.get_text()
    text = os.linesep.join([s for s in text.splitlines() if s])
    if not re.match(r'^\s*$', text):
        print(text)

My output:

፳
ወትቤሎን ኢትስምያኒ ኖሔሚን ስምያኒ መራር እስመ መረርኩ ፈድፋደ ወብዙኀ ።
፳፩
አንሰ ምልእትየ ሖርኩ ወዕራቅየ አግብአኒ እግዚአብሔር</span> ለምንት ትብላኒ ኖሔሚን እንዘ

What I try to get:

፳ ወትቤሎን ኢትስምያኒ ኖሔሚን ስምያኒ መራር እስመ መረርኩ ፈድፋደ ወብዙኀ ።
፳፩ አንሰ ምልእትየ ሖርኩ ወዕራቅየ አግብአኒ እግዚአብሔር</span> ለምንት ትብላኒ ኖሔሚን እንዘ

Should I select the td's in separate "soups"?

Solution

Instead of selecting the cells simply select each row and use get_text(separator=' ',strip=True):

for row in soup.select('div.geezBibleChapterContainer tr'):
    print(row.get_text(' ',strip=True))

What leads to:

፩ በቀዳሚ ገብረ እግዚአብሔር ሰማየ ወምድረ ።
፪ ወምድርሰ ኢታስተርኢ ወኢኮነት ድሉተ ወጽልመት መልዕልተ ቀላይ ወመንፈሰ እግዚአብሔር ይጼልል መልዕልተ ማይ ።
፫ ወይቤ እግዚአብሔር ለይኩን ብርሃን ወኮነ ብርሃን ።
፬ ወርእዮ እግዚአብሔር ለብርሃን ከመ ሠናይ ወፈለጠ እግዚአብሔር ማእከለ ብርሃን ወማእከለ ጽልመት ።
፭ ወሰመዮ እግዚአብሔር ለብርሃን ዕለተ ወለጽልመት ሌሊተ ወኮነ ሌሊተ ወጸብሐ ወኮነ መዓልተ ፩ ።
፮ ወይቤ እግዚአብሔር ለይኩን ጠፈር ማእከለ ማይ ከመ ይፍልጥ ማእከለ ማይ ወኮነ ከማሁ ።
፯ ወገብረ እግዚአብሔር ጠፈረ ወፈለጠ እግዚአብሔር ማእከለ ማይ ዘታሕተ ጠፈር ወማእከለ ማይ ዘመልዕልተ ጠፈር ።
፰ ወሰመዮ እግዚአብሔር ለውእቱ ጠፈር ሰማየ ወርእየ እግዚአብሔር ከመ ሠናይ ወኮነ ሌሊተ ወጸብሐ ወኮነ ካልእተ ዕለተ ።

Example

import requests
import bs4

url = "https://www.ethiopicbible.com/books/%E1%8A%A6%E1%88%AA%E1%89%B5-%E1%8B%98%E1%8D%8D%E1%8C%A5%E1%88%A8%E1%89%B5-1"
parameters = {}
response = requests.get(url, params=parameters)
soup = bs4.BeautifulSoup(response.text, "html.parser")

for row in soup.select('div.geezBibleChapterContainer tr'):
    print(row.get_text(separator=' ',strip=True))

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 13, 2024

[FIXED] How to select two td and output as single line with bs4?

Issue

Solution

Example

0 comments:

Post a Comment

Popular Posts

Labels