Issue
I'm using Partial method to pass 2 parameters which are not iterables, thus i shouldn't use that in the Map()
function. I'm also using ThreadPoolExecutor for I\O bound task that i have here.
the problem is that inside of the get_the_text_par()
function, i have a for loop which should go through all the rows and send the requests for each row (link) but it's doing it only for the first row and skips the other rows. How can i fix the issue or what am i missing here.
get_the_text_par = partial(get_the_text,_link_column=link,_firms=firms)
with ThreadPoolExecutor() as executor:
#chunk_size = len(results) // 10
chunk_size= len(results) if len(results)<10 else len(results) // 10
chunks=[results.iloc[i:i + chunk_size] for i in range(0, len(results),chunk_size)]
result = list(executor.map(get_the_text_par,chunks))
Get_the_Text implementation:
def get_the_text(_df,_firms:list,_link_column:str):
'''
sending a request to recieve the Text of the Articles
Parameters
----------
_df : DataFrame
Returns
-------
dataframe with the text of the articles
'''
_df.reset_index(inplace=True)
print(_df)
for k,link in enumerate(_df[[f'{_link_column}']]):
print(k,'\n',_df.loc[k,f'{_link_column}'])
if link:
website_text=list()
# print(link,'\n','K:',k)
try:
page_status_code,page_content,page_url = send_two_requests(_df.loc[k,f'{_link_column}'])
......
.....
...
..
.
to import the data :
data = {
'index': [1366, 4767, 6140, 11898],
'DATE': ['2014-01-12', '2014-01-12', '2014-01-12', '2014-01-12'],
'SOURCES': ['go.com', 'bloomberg.com', 'latimes.com', 'usatoday.com'],
'SOURCEURLS': [
'http://abcnews.go.com/Business/wireStory/mercedes-recalls-372k-suvs-21445846',
'http://www.bloomberg.com/news/2014-01-12/vw-patent-application-shows-in-car-gas-heater.html',
'http://www.latimes.com/business/autos/la-fi-hy-autos-recall-mercedes-20140112-story.html',
'http://www.usatoday.com/story/money/cars/2014/01/12/mercedes-recall/4437279/'
],
'Tone': [-0.375235, -1.842752, 1.551724, 2.521008],
'Positive_Score': [2.626642, 1.228501, 3.275862, 3.361345],
'Negative_Score': [3.001876, 3.071253, 1.724138, 0.840336],
'Polarity': [5.628518, 4.299754, 5.0, 4.201681],
'Activity_Reference_Density': [22.326454, 18.918919, 22.931034, 19.327731],
'Self_Group_Reference_Density': [0.0, 0.0, 0.344828, 0.840336],
'Year': [2014, 2014, 2014, 2014],
'Month': [1, 1, 1, 1],
'Day': [12, 12, 12, 12],
'Hour': [0, 0, 0, 0],
'Minute': [0, 0, 0, 0],
'Second': [0, 0, 0, 0],
'Mentioned_firms': ['mercedes', 'vw', 'mercedes', 'mercedes'],
'text': ['', '', '', '']
}
# Creating a DataFrame
df = pd.DataFrame(data)
Solution
The problem you're encountering may be attributed to how you're employing the enumerate
function within the loop. While iterating over _df[[f'{_link_column}']]
, you are actually traversing a DataFrame, not a series of links. Consequently, the loop is not correctly extracting the links, leading to only the first row being processed.
Alternatively you can use the method itertuples
to fix your issue :
def get_the_text(_df, _firms: list, _link_column: str):
_df.reset_index(inplace=True)
print(_df)
for row in _df.itertuples(index=False):
link = getattr(row, f'{_link_column}')
print(link)
if link:
website_text = list()
try:
page_status_code, page_content, page_url = send_two_requests(link)
# Your remaining code here...
Answered By - Reda Bourial
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.