Issue
I am trying to ease my financial data collection using the below code. However, seems to have a couple of issues with it. I want to scrape the following page for a specific href: 'https://www.witan.com/investor-information/factsheets/#currentPage=1'
The href I am trying to parse: href="/media/1767/witan-investment-trust_factsheet_310821.pdf"
Currently I am using selenium to do it, however it is a bit slow, so if it is possible to scrape using BS4 I am open for suggestions - my attempts so far have failed.
# Set options for selenium
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--window-size=1920,1200")
# Requests website using Selenium & ChromeDriver
driver = webdriver.Chrome('C:/AnaConda/chromedriver.exe', options=options)
driver.get('https://www.witan.com/investor-information/factsheets/#currentPage=1') # Requests website
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
link_finder = soup.findAll('a', href=re.compile('/witan-investment-trust-factsheet'))[0]
When using the above code I get: a class="ico-arrow document-view size" href="/media/1750/witan-investment-trust-factsheet-30jun2021.pdf" target="_blank"...
Hope someone can help me!
Solution
The HTML document with PDF links are loaded asynchronously via JavaScript (so beautifulsoup
doesn't see them inside initial page). To print all PDF links, you can do:
import requests
from bs4 import BeautifulSoup
api_url = "https://www.witan.com/umbraco/surface/listing/DocumentListing"
params = {
"currentPage": "1",
"year": "2021",
"isArchive": "false",
"pagination": "true",
}
with requests.session() as s:
# load cookies:
s.get("https://www.witan.com/investor-information/factsheets/")
# get document page:
soup = BeautifulSoup(s.get(api_url, params=params).content, "html.parser")
for a in soup.select(".document-view"):
print("https://www.witan.com" + a["href"])
Prints:
https://www.witan.com/media/1767/witan-investment-trust_factsheet_310821.pdf
https://www.witan.com/media/1763/witan-investment-trust_factsheet_310721.pdf
https://www.witan.com/media/1750/witan-investment-trust-factsheet-30jun2021.pdf
https://www.witan.com/media/1730/witan-investment-trust_factsheet_310521.pdf
https://www.witan.com/media/1718/witan-factsheet-30apr2021.pdf
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.