Issue
I am using kaggle browser. Looking to see if all the below can be done on this kaggle notebook.
Website url: click here
Website screenshot:
The downloading files here in the website are updated every hour and daily. I don't think any information on this website going to change except the xlsx
file content as you see in the website.
I want to download two things from this url: meta information and xlsx
files you see in the screenshot.
First, I want to download this meta information and make it a dataframe as given below. Now I am manually selecting them, copying them here. But I want to do it from the url
url_meta_df =
ID Type Name URL
CAL Region California https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx
CAR Region Carolinas https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAR.xlsx
CENT Region Central https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CENT.xlsx
FLA Region Florida https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_FLA.xlsx
Second: download each xlsx
file, save them.
My code: I have tried following based on an answer here in SO
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
Present output:
None
https://twitter.com/EIAgov
None
https://www.facebook.com/eiagov
None
#page-sub-nav
/
#
/petroleum/
/petroleum/weekly/
/petroleum/supply/weekly/
/naturalgas/
http://ir.eia.gov/ngs/ngs.html
/naturalgas/weekly/
/electricity/
/electricity/monthly/
....
Solution
This should get you all the files.
NOTE: This might take a while as the files are big - 20MB+ each. The total size is over 3GB
The meta info you need is in the get_data()
. It's a list of dictionaries.
import os
import random
import time
from pathlib import Path
from shutil import copyfileobj
from typing import List, Dict
import pandas as pd
import requests
from tabulate import tabulate
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203",
}
API_URL = "https://www.eia.gov/electricity/930-api//respondents/data"
FILE_BASE_URL = "https://www.eia.gov/electricity/gridmonitor/knownissues/xls"
DOWNLOAD_DIR = Path("xls_files")
PARAMS = {
"type[0]": "BA",
"type[1]": "BR",
"type[2]": "ER",
}
def wait_a_bit() -> None:
pseudo_random_wait = random.randint(1, 5)
print(f"Waiting {pseudo_random_wait} seconds...")
time.sleep(pseudo_random_wait)
def build_xls_links(data: List[dict]) -> Dict[str, str]:
links = {}
for item in data:
if item["type"] == "BA" and item["time_zone"] is not None:
links[item["id"]] = f"{FILE_BASE_URL}/{item['id']}.xlsx"
elif item["type"] == "ER":
links[item["id"]] = f"{FILE_BASE_URL}/Region_{item['id']}.xlsx"
return links
def build_metadata_table(data: List[dict], links: Dict[str, str]) -> List[List[str]]:
metadata_table = []
for item in data:
metadata_table.append(
[
item["id"],
item["name"],
item["type"] if item["type"] in ["BA", "ER"] else "Unknown",
links.get(item["id"], "N/A")
]
)
return metadata_table
def get_data(connection: requests.Session) -> List[dict]:
return connection.get(API_URL, params=PARAMS).json()[0]["data"]
def create_and_save_table(data: List[Dict], xls_links: Dict[str, str]) -> None:
# Do some funky stuff to get the data into a nice table
metadata_table = build_metadata_table(data, xls_links)
df = pd.DataFrame(metadata_table, columns=["ID", "Name", "Type", "Link"])
df.drop(df[df["Link"] == "N/A"].index, inplace=True)
df.drop(df[df["Type"] == "Unknown"].index, inplace=True)
df["Type"] = df["Type"].map({"ER": "Region", "BA": "BA"})
df.sort_values(
by=["Type", "ID"],
inplace=True,
ascending=[False, True],
)
df.to_csv(DOWNLOAD_DIR / "metadata.csv", index=False)
print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))
def fetch_files(connection: requests.Session, xls_links: Dict[str, str]) -> None:
total_files = len(xls_links)
for count, file_link in enumerate(xls_links.values(), start=1):
file_name = file_link.split("/")[-1]
print(f"Downloading file {count} of {total_files}: {file_name}")
file_response = connection.get(file_link, stream=True)
with open(DOWNLOAD_DIR / file_name, "wb") as f:
copyfileobj(file_response.raw, f)
wait_a_bit()
def downloader() -> None:
with requests.Session() as connection:
connection.headers.update(headers)
os.makedirs(DOWNLOAD_DIR, exist_ok=True)
data = get_data(connection)
xls_links = build_xls_links(data)
create_and_save_table(data, xls_links)
print("Would you like to download the files? (y/n)")
user_input = input(">>> ")
if user_input.lower() != "y":
exit()
fetch_files(connection, xls_links)
if __name__ == "__main__":
downloader()
You should see this:
+------+--------------------------------------------------------------------+--------+------------------------------------------------------------------------------+
| ID | Name | Type | Link |
|------+--------------------------------------------------------------------+--------+------------------------------------------------------------------------------|
| CAL | California | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx |
| CAN | Canada | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAN.xlsx |
| CAR | Carolinas | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAR.xlsx |
| CENT | Central | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CENT.xlsx |
| FLA | Florida | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_FLA.xlsx |
| MEX | Mexico | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_MEX.xlsx |
| MIDA | Mid-Atlantic | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_MIDA.xlsx |
| MIDW | Midwest | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_MIDW.xlsx |
| NA | Outside Lower 48 | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_NA.xlsx |
| NW | Northwest | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_NW.xlsx |
| SE | Southeast | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_SE.xlsx |
| SW | Southwest | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_SW.xlsx |
| US48 | United States Lower 48 | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_US48.xlsx |
| AEC | PowerSouth Energy Cooperative | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AEC.xlsx |
| AECI | Associated Electric Cooperative, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AECI.xlsx |
| AVA | Avista Corporation | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AVA.xlsx |
| AVRN | Avangrid Renewables, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AVRN.xlsx |
| AZPS | Arizona Public Service Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AZPS.xlsx |
| BANC | Balancing Authority of Northern California | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/BANC.xlsx |
| BPAT | Bonneville Power Administration | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/BPAT.xlsx |
| CHPD | Public Utility District No. 1 of Chelan County | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CHPD.xlsx |
| CISO | California Independent System Operator | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CISO.xlsx |
| CPLE | Duke Energy Progress East | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CPLE.xlsx |
| CPLW | Duke Energy Progress West | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CPLW.xlsx |
| DEAA | Arlington Valley, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/DEAA.xlsx |
| DOPD | PUD No. 1 of Douglas County | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/DOPD.xlsx |
| DUK | Duke Energy Carolinas | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/DUK.xlsx |
| EEI | Electric Energy, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/EEI.xlsx |
| EPE | El Paso Electric Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/EPE.xlsx |
| ERCO | Electric Reliability Council of Texas, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/ERCO.xlsx |
| FMPP | Florida Municipal Power Pool | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/FMPP.xlsx |
| FPC | Duke Energy Florida, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/FPC.xlsx |
| FPL | Florida Power & Light Co. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/FPL.xlsx |
| GCPD | Public Utility District No. 2 of Grant County, Washington | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GCPD.xlsx |
| GLHB | GridLiance | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GLHB.xlsx |
| GRID | Gridforce Energy Management, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GRID.xlsx |
| GRIF | Griffith Energy, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GRIF.xlsx |
| GRMA | Gila River Power, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GRMA.xlsx |
| GVL | Gainesville Regional Utilities | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GVL.xlsx |
| GWA | NaturEner Power Watch, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GWA.xlsx |
| HGMA | New Harquahala Generating Company, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/HGMA.xlsx |
| HST | City of Homestead | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/HST.xlsx |
| IID | Imperial Irrigation District | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/IID.xlsx |
| IPCO | Idaho Power Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/IPCO.xlsx |
| ISNE | ISO New England | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/ISNE.xlsx |
| JEA | JEA | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/JEA.xlsx |
| LDWP | Los Angeles Department of Water and Power | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/LDWP.xlsx |
| LGEE | Louisville Gas and Electric Company and Kentucky Utilities Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/LGEE.xlsx |
| MISO | Midcontinent Independent System Operator, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/MISO.xlsx |
| NEVP | Nevada Power Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NEVP.xlsx |
| NSB | Utilities Commission of New Smyrna Beach | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NSB.xlsx |
| NWMT | NorthWestern Corporation | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NWMT.xlsx |
| NYIS | New York Independent System Operator | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NYIS.xlsx |
| OVEC | Ohio Valley Electric Corporation | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/OVEC.xlsx |
| PACE | PacifiCorp East | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PACE.xlsx |
| PACW | PacifiCorp West | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PACW.xlsx |
| PGE | Portland General Electric Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PGE.xlsx |
| PJM | PJM Interconnection, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PJM.xlsx |
| PNM | Public Service Company of New Mexico | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PNM.xlsx |
| PSCO | Public Service Company of Colorado | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PSCO.xlsx |
| PSEI | Puget Sound Energy, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PSEI.xlsx |
| SC | South Carolina Public Service Authority | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SC.xlsx |
| SCEG | Dominion Energy South Carolina, Inc. | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SCEG.xlsx |
| SCL | Seattle City Light | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SCL.xlsx |
| SEC | Seminole Electric Cooperative | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SEC.xlsx |
| SEPA | Southeastern Power Administration | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SEPA.xlsx |
| SOCO | Southern Company Services, Inc. - Trans | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SOCO.xlsx |
| SPA | Southwestern Power Administration | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SPA.xlsx |
| SRP | Salt River Project Agricultural Improvement and Power District | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SRP.xlsx |
| SWPP | Southwest Power Pool | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SWPP.xlsx |
| TAL | City of Tallahassee | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TAL.xlsx |
| TEC | Tampa Electric Company | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TEC.xlsx |
| TEPC | Tucson Electric Power | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TEPC.xlsx |
| TIDC | Turlock Irrigation District | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TIDC.xlsx |
| TPWR | City of Tacoma, Department of Public Utilities, Light Division | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TPWR.xlsx |
| TVA | Tennessee Valley Authority | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TVA.xlsx |
| WACM | Western Area Power Administration - Rocky Mountain Region | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WACM.xlsx |
| WALC | Western Area Power Administration - Desert Southwest Region | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WALC.xlsx |
| WAUW | Western Area Power Administration - Upper Great Plains West | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WAUW.xlsx |
| WWA | NaturEner Wind Watch, LLC | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WWA.xlsx |
| YAD | Alcoa Power Generating, Inc. - Yadkin Division | BA | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/YAD.xlsx |
+------+--------------------------------------------------------------------+--------+------------------------------------------------------------------------------+
And if you type y
after the prompt:
Would you like to download the files? (y/n)
You'll get this:
Downloading file 1 of 81: Region_CENT.xlsx
Waiting 5 seconds...
Downloading file 2 of 81: AEC.xlsx
Waiting 4 seconds...
Downloading file 3 of 81: AECI.xlsx
Waiting 5 seconds...
Downloading file 4 of 81: AVA.xlsx
Waiting 4 seconds...
Downloading file 5 of 81: AZPS.xlsx
Waiting 2 seconds...
Downloading file 6 of 81: BANC.xlsx
Waiting 2 seconds...
Downloading file 7 of 81: BPAT.xlsx
Waiting 1 seconds...
and more...
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.