Issue
The basics of my issue is passing along "extra" information while I'm iterating over a data frame. I am passing each value in the dataframe to a function to be checked. But I need to pass and return some extra identifying information as well. (The ID number.) So I need a way that allows me to pass this extra info to the next function so that I can return it with the result of that function.
Basic problem:
import pandas as pd
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', '[email protected]', 'http://www.youtube.com', '888-555-5556 Ryan Parkes [email protected]'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.stackoverflow.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
for col in urls.columns:
for url in urls[col]:
if url:
print(url,col)
#I need to be able to print the corresponding ID that belongs to each URL
Desired output:
ID URL COL
1 apple link1
etc...
I think if that can be done with the for/in/if structure, then it can be applied to the real code below:
The real code is a little more complex. I am using asyncio.gather to process the dataframe. Passing the column name was simple, but I don't know how to get the ID.
import asyncio, aiohttp, time, pandas as pd
from validator_collection import checkers
url_df = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', '[email protected]', 'http://www.youtube.com', '888-555-5556 Ryan Parkes [email protected]'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.stackoverflow.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
async def get(url, sem, session,col):
try:
async with sem, session.get(url=url,raise_for_status=True, timeout=20) as response:
resp = await response.read()
print("Successfully got url {} from column {} with response of length {}.".format(url, col, len(resp)))
except Exception as e:
print("Unable to get url {} due to {}.".format(url, e.__class__))
async def main(urls):
sem = asyncio.BoundedSemaphore(50)
async with aiohttp.ClientSession() as session:
ret = await asyncio.gather(*[get(url, sem, session, col)
for col in urls.columns #for each column in the dataframe
for url in urls[col] #for each row in the column
if url #if the item isn't null
if checkers.is_url(url)==True]) #if url is valid
print("Finalized all. ret is a list of len {} outputs.".format(len(ret)))
amount = url_df.count(axis='columns').sum()
start = time.time()
asyncio.run(main(url_df))
end = time.time()
print("Took {} seconds to pull {} websites.".format(end - start, amount))
Solution
Rather than iterating over all columns, it sounds like it might make sense (?) if you want to iterate over the rows instead.
There are many ways you could go about this:
For starters, it might make sense to index your dataframe by the ID column:
url_df = url_df.set_index('ID')
Then, among other possibilities, you could use the itertuples()
method:
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for link in row[1:]:
print(ID, link)
Output:
1 apple
1 http://www.bing.com
1 http://www.stackoverflow.com~|~http://www.ebay.com
2 www.google.com
2 http://www.linkedin.com
2 http://www.imdb.com
5 [email protected]
5
5 http://www.google.co.uk
25 http://www.youtube.com
25 please call now
25 more random text that could be really long and annoying
26 888-555-5556 Ryan Parkes [email protected]
26 http://www.reddit.com
26 over the hills and through the woods
If you want to include the column names as well you could do it like:
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for link, col in zip(row[1:], row._fields[1:]):
print(ID, link, col)
For use in your actual code, it might be clearer if you wrapped this up in a subroutine like:
def iter_links(df):
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for url, col in zip(row[1:], row._fields[1:]):
if url and checkers.is_url(url):
yield (ID, col, url)
Then use this in your code like:
await asyncio.gather(*(get(sess, sem, ID, col, url)
for ID, col, url in iter_links(df)))
Answered By - Iguananaut
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.