Issue
My goal is to scrape multiple profile links and then scrape specific data on each of these profiles.
Here is my code to get multiple profile links (it should work fine):
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
profiles = soup.find_all(href=re.compile("/profile/kaid"))
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
print(final_profile_link)
Now here is my code to get the specific data on just one profile (it should work fine too):
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
import re
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
#for getdate in user_calendar:
last_activity = user_calendar.find('span',class_='streak-cell filled')
last_activity_date = last_activity['title']
#print(last_activity)
#print(last_activity_date)
else:
last_activity_date='NA'
filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()
My question is : how can I automate my scripts? In other words: How can I merge these two scripts?
The goal is to create a sort of variable that is going to be a different profile link every time.
And then for each profile link to get the specific data and then put it into the csv file (a new row for each profile).
Solution
It is fairly very straight forward to do this. I instead of printing the profile links store them to a list variable. Then loop through the list variable to scrape each link and then write to the csv file. Some pages do not have all the details so you have to handle those exceptions as well. In the code below I have marked them also as 'NA', following the convention used in your code. One other note for future is to consider using the python's inbuilt csv module for reading and writing csv files.
Merged Script
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)
filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
for link in profile_list:
print("Scraping ",link)
session = HTMLSession()
r = session.get(link)
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()
Sample Output from khanscrapetry1.csv
date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0,Saturday Jun 4 2016
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0,Saturday Jun 4 2016
6 years ago,3164708,1276,164,2793,348,67,16,3,5663,885,Wednesday Oct 31 2018
6 years ago,3164708,1276,164,2793,348,67,16,3,5663,885,Wednesday Oct 31 2018
NA,NA,NA,18,NA,0,0,NA,NA,0,NA,Monday Dec 24 2018
NA,NA,NA,18,NA,0,0,NA,NA,0,NA,Monday Dec 24 2018
5 years ago,240334,56,7,42,6,0,2,NA,12,2,Tuesday Nov 20 2018
5 years ago,240334,56,7,42,6,0,2,NA,12,2,Tuesday Nov 20 2018
...
Answered By - Bitto Bennichan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.