Issue
Moving to the next page while web scraping and changing the format for date
url_list is a list of urls, one of them would be http://www.moneycontrol.com/company-article/cadilahealthcare/news/CHC#CHC I found out that to move to different years and different pages, there is an href code, but I cannot seem to use it. Here`s the code that is extracting links from page 1. I want to do it for all the years and pages available.
Also when I extract the date from the html it is in the format [Last Updated : Feb 07, 2019 03:05 PM IST | Source: Moneycontrol.com] I want the date in mm/dd/yy format, how would I got about doing that also?
for urls in url_list:
html = requests.get(urls)
soup = BeautifulSoup(html.text,'html.parser') # Create a BeautifulSoup object
# Retrieve a list of all the links and the titles for the respective links
#word1,word2,word3 = "US","USA","USFDA"
sub_links = soup.find_all('a', class_='arial11_summ')
for links in sub_links:
sp = BeautifulSoup(str(links),'html.parser') # first convert into a string
tag = sp.a
#if word1 in tag['title'] or word2 in tag['title'] or word3 in tag['title']:
category_links = Base_url + tag["href"]
List_of_links.append(category_links)
time.sleep(3)
What I want to do is to scrape the 1st page then move to the next page and so on, after scraping the available pages for a particular year the code moves on to the next year. Kindly explain how would I go about doing this.
Solution
Move to next page:
Add param to URL like this https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018
For list of the years, you could get from 1st page
Extract the date: sub string to get datetime only, then parse time and timezone like this
I updated set timezone by using pytz
input = 'Feb 07, 2019 03:05 PM IST'
str_time = input[:len(input) - 4]
str_timezone = input[len(input) - 3:]
datetime_object = datetime.strptime(str_time, '%b %d, %Y %I:%M %p')
if str_timezone == 'IST':
# base on https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
# assume it's Indian/Mauritius
tz = pytz.timezone('Indian/Mauritius')
else:
tz = pytz.timezone('UTC')
output = tz.localize(datetime_object)
# test
print(output.strftime('%X %x %z'))
Answered By - Trung NT Nguyen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.