Issue
I normally use the googlesearch
library as follows:
from googlesearch import search
list(search(f"{query}", num_results))
But I now keep getting this error:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://www.google.com/sorry/index?continue=https://www.google.com/search%3Fq%{query}%26num%3D10%26hl%3Den%26start%3D67&hl=en&q=EhAmABcACfAIIME0fDvEUYF8GOKX1KQGIjAEGg2nloeEEAcko9umYCP9uPHRWoSo2odE3n3ZgbQ1L6lDvGfyai6798pyy3iU5vcyAXJaAUM
I developed a "hacky" solution using requests
and BeautifulSoup
, but it's very inefficient and takes me 1 hour to get 100 URLs, when the line above would take 1 second:
search_results = []
retry = True
while retry:
try:
response = requests.get(f"https://www.google.com/search?q={query}",
headers={
'User-Agent': user_agent,
'Referer': 'https://www.google.com/',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,en-gb',
},
proxies={
"http": proxy,
"https": proxy},
timeout=TIMEOUT_THRESHOLD*2)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.select('.yuRUbf a'):
url = link['href']
search_results.append(url)
if len(search_results) >= num_results:
retry = False
break
else:
proxy = get_working_proxy(proxies)
user_agent = random.choice(user_agents)
except Exception as e:
proxy = get_working_proxy(proxies)
user_agent = random.choice(user_agents)
print(f"An error occurred in tips search: {str(e)}")
Is there a better, easier way to still use my proxies to get a list of Google search results for a query?
Solution
The reason your code is taking longer is most likely because you are using proxies, which take time to do their thing, and maybe because your timeout threshold is too high, meaning you're waiting too long for a request to time out.
Things you can try:
- if you are using free proxies, buy some proxies- paid ones tend to have better availability and speed
- reduce the threshold (I noticed you had doubled it, try not doing that and see how it goes)
The problem is, your question should probably be "how can I scrape Google Search faster?". The reason you were getting those 429 errors is because you are attacking Google's servers with requests, and I would guess scraping is against Google's terms of service. So the real answer is:
- Use the Google Search API, or do it slowly, and have patience.
Disclaimer: Scraping websites that ask you not to do it is a morally dubious act, and not something I endorse.
Answered By - Mark
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.