Issue
Environment:
- Python 3.9.4
- beautifulsoup4==4.12.2
Code:
from bs4 import BeautifulSoup
test_content = '''<html><head></head><body><p>123</p><p>123<br>123</p></body></html>'''
bs = BeautifulSoup(test_content, 'html.parser')
Why does bs.find_all('p')
returns all elements, while bs.find_all('p', string=True)
only returns elements without <br>
in them?
>>> bs.find_all('p')
[<p>123</p>, <p>123<br/>123</p>]
>>> bs.find_all('p', string=True)
[<p>123</p>]
>>> import re
>>> bs.find_all('p', string=re.compile('.+'))
[<p>123</p>]
I've searched through docs of BeautifulSoup yet found nothing related.
My question is why adding string=True makes find_all not returning elements with br tags?
And how can I find all elements (with or without <br>
tags)? Not passing the string
arg doesn't help here, cause my acutal need is to find elements with certain keywords, e.g. string=re.compile('KEYWORD')
Solution
Passing string=True
will check the string
attribute of each element.
If we check this for each element, you'll find that only one element has a .string
value:
for element in bs.find_all(True):
print(element, element.string)
<html><head></head><body><p>123</p><p>123<br/>123</p></body></html> None
<head></head> None
<body><p>123</p><p>123<br/>123</p></body> None
<p>123</p> 123
<p>123<br/>123</p> None
<br/> None
In the case of <p>123<br/>123</p>
the element's .string
attribute is None
. This is because it actually has 3 children: the text '123'
, the <br/>
tag, and the text '123'
again.
>>> print(p.string)
None
>>> print(list(p.children))
# ['123', <br/>, '123']
In other words, .string
will only be present when the element contains only a string and no child elements.
If you want to find all elements that have any text including their children, you can do the following:
def has_text(tag):
return bool(tag.text)
bs.find_all(has_text)
Answered By - sytech
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.