Monday, February 28, 2022

if loop that iterates over items in a dataframe

February 28, 2022 for-loop, pandas, python, python-asyncio No comments

Issue

The basics of my issue is passing along "extra" information while I'm iterating over a data frame. I am passing each value in the dataframe to a function to be checked. But I need to pass and return some extra identifying information as well. (The ID number.) So I need a way that allows me to pass this extra info to the next function so that I can return it with the result of that function.

Basic problem:

import pandas as pd

urls = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'link1':['apple', 'www.google.com', '[email protected]', 'http://www.youtube.com', '888-555-5556 Ryan Parkes [email protected]'],
    'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
    'link3':['http://www.stackoverflow.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
    })

for col in urls.columns:
    for url in urls[col]:
        if url:
            print(url,col)
#I need to be able to print the corresponding ID that belongs to each URL

Desired output:

ID    URL     COL
1     apple   link1
etc...

I think if that can be done with the for/in/if structure, then it can be applied to the real code below:

The real code is a little more complex. I am using asyncio.gather to process the dataframe. Passing the column name was simple, but I don't know how to get the ID.

import asyncio, aiohttp, time, pandas as pd
from validator_collection import checkers

url_df = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'link1':['apple', 'www.google.com', '[email protected]', 'http://www.youtube.com', '888-555-5556 Ryan Parkes [email protected]'],
    'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
    'link3':['http://www.stackoverflow.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
    })

async def get(url, sem, session,col):
    try:
            async with sem, session.get(url=url,raise_for_status=True, timeout=20) as response:
                    resp = await response.read()
                    print("Successfully got url {} from column {} with response of length {}.".format(url, col, len(resp)))
    except Exception as e:
        print("Unable to get url {} due to {}.".format(url, e.__class__))


async def main(urls):
    sem = asyncio.BoundedSemaphore(50)
    async with aiohttp.ClientSession() as session:
        ret = await asyncio.gather(*[get(url, sem, session, col) 
                                                for col in urls.columns #for each column in the dataframe
                                                    for url in urls[col] #for each row in the column
                                                        if url #if the item isn't null
                                                            if checkers.is_url(url)==True]) #if url is valid
    print("Finalized all. ret is a list of len {} outputs.".format(len(ret)))

amount = url_df.count(axis='columns').sum()
start = time.time()
asyncio.run(main(url_df))
end = time.time()

print("Took {} seconds to pull {} websites.".format(end - start, amount))

Solution

Rather than iterating over all columns, it sounds like it might make sense (?) if you want to iterate over the rows instead.

There are many ways you could go about this:

For starters, it might make sense to index your dataframe by the ID column:

url_df = url_df.set_index('ID')

Then, among other possibilities, you could use the itertuples() method:

for row in url_df.itertuples():
    # The first item will always be the index, so:
    ID = row[0]  # or ID = row.Index
    
    # Then do whatever you want for the other columns:
    for link in row[1:]:
        print(ID, link)

Output:

1 apple
1 http://www.bing.com
1 http://www.stackoverflow.com~|~http://www.ebay.com
2 www.google.com
2 http://www.linkedin.com
2 http://www.imdb.com
5 [email protected]
5 
5 http://www.google.co.uk
25 http://www.youtube.com
25  please call now
25 more random text that could be really long and annoying
26 888-555-5556 Ryan Parkes [email protected]
26 http://www.reddit.com
26 over the hills and through the woods

If you want to include the column names as well you could do it like:

for row in url_df.itertuples():
    # The first item will always be the index, so:
    ID = row[0]  # or ID = row.Index
    
    # Then do whatever you want for the other columns:
    for link, col in zip(row[1:], row._fields[1:]):
        print(ID, link, col)

For use in your actual code, it might be clearer if you wrapped this up in a subroutine like:

def iter_links(df):
    for row in url_df.itertuples():
        # The first item will always be the index, so:
        ID = row[0]  # or ID = row.Index
    
        # Then do whatever you want for the other columns:
        for url, col in zip(row[1:], row._fields[1:]):
            if url and checkers.is_url(url):
                yield (ID, col, url)

Then use this in your code like:

await asyncio.gather(*(get(sess, sem, ID, col, url)
                       for ID, col, url in iter_links(df)))

Answered By - Iguananaut

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 28, 2022

[FIXED] python pass extra information through nested for/in/if loop that iterates over items in a dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels