Monday, December 4, 2023

[FIXED] How to easily get list of URLs from Google search using a proxy? - python

December 04, 2023 beautifulsoup, python, python-3.x, python-requests, web-scraping No comments

Issue

I normally use the googlesearch library as follows:

from googlesearch import search
list(search(f"{query}", num_results))

But I now keep getting this error:

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://www.google.com/sorry/index?continue=https://www.google.com/search%3Fq%{query}%26num%3D10%26hl%3Den%26start%3D67&hl=en&q=EhAmABcACfAIIME0fDvEUYF8GOKX1KQGIjAEGg2nloeEEAcko9umYCP9uPHRWoSo2odE3n3ZgbQ1L6lDvGfyai6798pyy3iU5vcyAXJaAUM

I developed a "hacky" solution using requests and BeautifulSoup, but it's very inefficient and takes me 1 hour to get 100 URLs, when the line above would take 1 second:

    search_results = []
    retry = True
    while retry:
        try:
            response = requests.get(f"https://www.google.com/search?q={query}", 
                                        headers={
                                            'User-Agent': user_agent,
                                            'Referer': 'https://www.google.com/',
                                            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 
                                            'Accept-Encoding': 'gzip, deflate, br', 
                                            'Accept-Language': 'en-US,en;q=0.9,en-gb', 
                                        }, 
                                        proxies={
                                            "http": proxy, 
                                            "https": proxy}, 
                                        timeout=TIMEOUT_THRESHOLD*2)
        
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                
                for link in soup.select('.yuRUbf a'):
                    url = link['href']
                    
                    search_results.append(url)
                    
                    if len(search_results) >= num_results:
                        retry = False
                        break  
            
            else:
                proxy = get_working_proxy(proxies)
                user_agent = random.choice(user_agents)

        except Exception as e:
            proxy = get_working_proxy(proxies)
            user_agent = random.choice(user_agents)
            print(f"An error occurred in tips search: {str(e)}")

Is there a better, easier way to still use my proxies to get a list of Google search results for a query?

Solution

The reason your code is taking longer is most likely because you are using proxies, which take time to do their thing, and maybe because your timeout threshold is too high, meaning you're waiting too long for a request to time out.

Things you can try:

if you are using free proxies, buy some proxies- paid ones tend to have better availability and speed
reduce the threshold (I noticed you had doubled it, try not doing that and see how it goes)

The problem is, your question should probably be "how can I scrape Google Search faster?". The reason you were getting those 429 errors is because you are attacking Google's servers with requests, and I would guess scraping is against Google's terms of service. So the real answer is:

Use the Google Search API, or do it slowly, and have patience.

Disclaimer: Scraping websites that ask you not to do it is a morally dubious act, and not something I endorse.

Answered By - Mark

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[FIXED] How to easily get list of URLs from Google search using a proxy? - python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels