Issue
I have beautifulsoup4 (4.9.0) installed and am trying to parse some html. Python version 3.7
I'm gathering data from some tables that are split by line breaks <br>
within the cells e.g.:
<td>some text<br>some more text</td>
However .get_text()
seems to ignore the line breaks and print it all int to one line:
html = '<td>some text<br>some more text</td>'
soup = BeautifulSoup(html, features='html.parser')
print(soup)
>> <td>some text<br/>some more text</td>
print(soup.get_text())
>> some textsome more text
The <br>
is converted to a <br/>
but I don't know much HTML so not sure if that's significant.
Desired outcome
A list of the strings that are between each line break. I was thinking to use the .get_text()
method, and then .split()
the resulting string by the line break character, e.g.:
html = '<td>some text<br>some more text</td>'
soup = BeautifulSoup(html, features='html.parser')
strings = soup.get_text().split('?')
>> ['some text', 'some more text']
Anyone know how to get get_text()
to recognise the line breaks, and what the ?
would need to be? I was thinking maybe to replace the line breaks with an unambiguous character/string that won't be ignored, and split with that. More elegant solutions would be appreciated tho!
Thanks
Solution
My solution, as described in the question. Replacing the <br>
tag with an unambiguous string, then splitting the string using that:
from bs4 import BeautifulSoup
html = '<td>some text<br>some more text</td>'
soup = BeautifulSoup(html, features='html.parser')
delimiter = '###' # unambiguous string
for line_break in soup.findAll('br'): # loop through line break tags
line_break.replaceWith(delimiter) # replace br tags with delimiter
strings = soup.get_text().split(delimiter) # get list of strings
>> ['some text', 'some more text'] # output
Answered By - Chris Browne
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.