Sunday, February 27, 2022

[FIXED] BeautifulSoup: adding strings to dictionary

February 27, 2022 beautifulsoup, dataframe, dictionary, pandas, python No comments

Issue

I created a scraper, but I keep struggling with one part: getting the keywords associated with a movie/tv-show title.

I have a df with the following urls

keyword_link_list = ['https://www.imdb.com/title/tt7315526/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt11723916/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt7844164/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt2034855/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt11215178/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt10941266/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt13210836/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt0913137/keywords?ref_=tt_ql_sm']
df = pd.DataFrame({'keyword_link':keyword_link_list})
print(df)

Then, I like the loop through the column keyword_link, get all the keywords, and add them to a dictionary. I managed to get all the keywords, but I do not manage to add them to a dictionary. It seems like a simple problem, but I'm not seeing what I'm doing wrong (after hours of struggling). Many thanks in advance for your help!

# Import packages
import requests               
import re                     
from bs4 import BeautifulSoup 
import bs4 as bs
import pandas as pd

# Loop through column keyword_link and get the keywords for each link
keyword_dicts = []
for index, row in df.iterrows():
    keyword_link = row['keyword_link'] 
    print(keyword_link)
    headers = {"Accept-Language": "en-US,en;q=0.5"}
    r=requests.get(keyword_link, headers=headers)
    html = r.text    
    soup = bs.BeautifulSoup(html, 'html.parser')
    elements = soup.find_all('td', {'class':"soda sodavote"}) 
    for element in elements:
        for keyword in element.find_all('a'):
            keyword = keyword['href']
            keyword = re.sub(r'\/search/keyword\?keywords=', '', keyword)
            keyword = re.sub(r'\?item=kw\d+', '', keyword)
            print(keyword)    
    
    keyword_dict = {}
    keyword_dict['keyword'] = keyword
    keyword_dicts.append(keyword_dict)
    
print(keyword_dicts)

Update

After running the definition, I get the following error:

Solution

Note: cause expected output is not that clear and could be improved, this example deals with operating on your list only. you can use the output to create a dataframe, lists, ...

What happens?

Your dictionary is defined right behind the loop - You won't get any information to store and your list just contains [{'keyword': ''}]

How to fix?

Append your dictionary while iterating over the keywords.

Alternativ approach:

However, it do not need a dataframe and only one line to get your keywords:

keywords = [e.a.text for e in soup.select('[data-item-keyword]')]

In following example I come up with some variations on how and what could be collected:

Collect just the keywords separated by whitespace:

[e.a.text for e in soup.select('[data-item-keyword]')]

Collect same keywords separated by "-" as in the url:

['-'.join(x.split()) for x in keywords]

collect keywords and votings maybe also interesting:

[{'keyword':k,'votes':v} for k,v in zip(keywords,votes)]

Example

import requests, time
from bs4 import BeautifulSoup
import pandas as pd

keyword_link_list = ['https://www.imdb.com/title/tt7315526/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt11723916/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt7844164/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt2034855/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt11215178/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt10941266/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt13210836/keywords?ref_=tt_ql_sm',
 'https://www.imdb.com/title/tt0913137/keywords?ref_=tt_ql_sm'
]

def cook_soup(url):
    #do not harm the website add some delay
    #time.sleep(2)
    headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36','Accept-Language': 'en-US,en;q=0.5'
    }
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.text,'lxml')
    return soup
    
data = []

for url in keyword_link_list:
    
    soup = cook_soup(url)
    keywords = [e.a.text for e in soup.select('[data-item-keyword]')]
    votes = [e['data-item-votes'] for e in soup.select('[data-item-votes]')]
    data.append({
        'url':url,
        'keywords':keywords,
    })
    
print(data)
### pd.DataFrame(data)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, February 27, 2022

[FIXED] BeautifulSoup: adding strings to dictionary

Issue

Solution

What happens?

How to fix?

Alternativ approach:

Example

0 comments:

Post a Comment

Popular Posts

Labels