Issue
I am trying to extract phone, address and email from couple of corporate websites through webscraping
My code for that is as follows
l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []
# make a request to the link
response = requests.get(l)
soup = BeautifulSoup(response.content, "html.parser")
#soup = BeautifulSoup(response.content, 'html.parser')
phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
# extract the phone number information
match = soup.findAll(string=re.compile(phone_regex))
if match:
print("Found the matching string:", match)
else:
print("Matching string not found")
# extract email address information
mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"
match_a = soup.findAll(string=re.compile(mail))
match_a
The above code is working fine and it extracts phone number correctly, but it's not able to detect email address, same issue with other website (https://www.benefitexperts.com/about-us/)
Solution
The mail address you are looking for is located at href attribute of (if it exist) an tag as a string 'mailto:[email protected]'. So you need just to pass href as keyword argument to the findall function so it will match all nodes having href as attribute and match the regulare expression.
check more about keyword arguments at the BeautifulSoup official docs https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments
Or simply
match_a = soup.findAll(href=re.compile(mail))
you do some clean up to extract exactly mail address
match_a = [a['href'].strip('mailto:') for a in match_a]
Answered By - Ktifler
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.