Issue
I'm extracting the audios of a bunch of words from Oxford Learner's Dictionaries using BeautifulSoup
in Python. Here's the code:
#!/bin/python3
import sys
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
response = requests.get(sys.argv[1], headers=headers)
print("HTTP Response Status Code:", response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
print(list(soup.find(class_="phonetics")))
when I run the program using the following command
./english_audio.py "https://www.oxfordlearnersdictionaries.com/definition/english/hello_1?q=hello"
Solution
Why when I convert it to list its length becomes 4?
It is returned what you have selected a bs4.element.Tag
- An object corresponding to the single <span>
with class phonetics
that is parent of other tag elements nested in it.
This bs4.element.Tag
is iterable:
for e in soup.find(class_="phonetics"):
print(e.name)
and lead to:
None
div
None
div
Exactly this is what the built-in list()
function is doing, it converts an iterable object, such as a string or tuple, to a list
.
You have to select more specific to get only the first audio link in your <span>
- I used css selectors
here to simplify chaining of selectors:
soup.select_one('.phonetics [data-src-mp3]').get('data-src-mp3')
or to get a list
of all for hello:
[e.get('data-src-mp3') for e in soup.select('.phonetics [data-src-mp3]')]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.