Issue
from bs4 import BeautifulSoup
from lxml import etree
import requests
import re
URL = "https://csimarket.com/stocks/at_glance.php?code=AA"
HEADERS = ({'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', \
'Accept-Language': 'en-US, en;q=0.5'})
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
raw_html = soup.find('a', href="../Industry/Industry_Data.php?s=100")
print(raw_html)
I am getting:
\<span class="oran2"\>•\</span\>Basic Materials
I just want to "Basic Materials" how do i do that?
I am doing:
raw_html = soup.find('a', href="../Industry/Industry_Data.php?s=100")
I want to find ../Industry/Industry_Data.php
only. Thanks
Solution
when you do
raw_html = soup.find('a', href="../Industry/Industry_Data.php?s=100")
you get the whole tag as the result of the function. Currently it contains the Text and also a span that has a bullet point character.
So to get just the text ("Basic Materials"), you'll need to remove the span from the element. you can do this by using .decompose()
on the span(or any element that you want to remove in general).
After that you can use the .text
attribute to get the inner text of the a
tag.
PS: the .text
contains whitespace so doing a .strip()
is recommended.
code:
span = a_tag.find("span")
span.decompose()
print(a_tag.text.strip())
output:
Basic Materials
Answered By - Rohit Patil
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.