Issue
This code works fine a few days ago when I run it:
from bs4 import BeautifulSoup
import datetime
import requests
def getWeekMostRead(date):
nonfiction_page = requests.get("https://www.amazon.com/charts/"+date.isoformat()+"/mostread/nonfiction")
content = "amazon"+date.isoformat()+"_nonfiction.html"
with open(content, "w", encoding="utf-8") as nf_file:
print(nonfiction_page.content, file=nf_file)
mostRead_nonfiction = BeautifulSoup(nonfiction_page.content, features="html.parser")
nonfiction = mostRead_nonfiction.find_all("div", class_="kc-horizontal-rank-card")
mostread = []
for books in nonfiction:
if books.find(class_="kc-rank-card-publisher") is None:
mostread.append((
books.find(class_="kc-rank-card-title").string.strip(),
books.find(class_="kc-rank-card-author").string.strip(),
"",
books.find(class_="numeric-star-data").small.string.strip()
))
else:
mostread.append((
books.find(class_="kc-rank-card-title").string.strip(),
books.find(class_="kc-rank-card-author").string.strip(),
books.find(class_="kc-rank-card-publisher").string.strip(),
books.find(class_="numeric-star-data").small.string.strip()
))
return mostread
mostread = []
date = datetime.date(2020,1,1)
while date >= datetime.date(2015,1,1):
print("Scraped data from "+date.isoformat())
mostread.extend(getWeekMostRead(date))
date -= datetime.timedelta(7)
print("Currently saving scraped data to AmazonCharts.csv")
with open("AmazonCharts.csv", "w") as csv:
counter = 0
print("ID,Title,Author,Publisher,Rating", file=csv)
for book in mostread:
counter += 1
print('AmazonCharts'+str(counter)+',"'+book[0]+'","'+book[1]+'","'+book[2]+'","'+book[3]+'"', file=csv)
csv.close()
For some reason, today I tried to run it again and I got this included in the returned HTML file:
To discuss automated access to Amazon data please contact [email protected].\r\n\r\nFor information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
I understand that Amazon is a heavy anti-scraping data (or at least I read so from some replies and threads). I tried to use headers and delays in the code but it does not work. Would there be another way to try this? Or if I should wait, how long should I wait?
Solution
After a while, I have figured the solution. It is rather simply - there was no "2020-01-01" on Amazon, instead, I fixed it "2020-01-05".
Answered By - DiMario
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.