Sunday, December 31, 2023

[FIXED] Construct DataFrame from scraped data using Scrapy

December 31, 2023 python, scrapy, web-scraping No comments

Issue

I have a problem with constructing csv type data file from scraped data. I have managed to scrape the data from the table but when it comes to writing it I can't do that for days. I am using items and trying to write it to pandas data frame. I am using items list.

import scrapy
from wiki.items import WikiItem
import pandas as pd

class Spider(scrapy.Spider):

name = "wiki"
start_urls = ['https://datatables.net/']

def parse(self, response):

    items = {'Name':[], 'Position':[], 'Office':[], 'Age':[],
        'Start_Date':[],'Salary':[]}

    trs = response.xpath('//table[@id="example"]//tr')
    name = WikiItem()
    pos = WikiItem()
    office = WikiItem()
    age = WikiItem()
    start_data = WikiItem()
    salary = WikiItem()

    name['name'] = trs.xpath('//td[1]//text()').extract()
    pos['position'] = trs.xpath('//td[2]//text()').extract()
    office['office'] = trs.xpath('//td[3]//text()').extract()
    age['age'] = trs.xpath('//td[4]//text()').extract()
    start_data['start_data'] = trs.xpath('//td[5]//text()').extract()
    salary['salary'] = trs.xpath('td[6]//text()').extract()

    items['Name'].append(name)
    items['Position'].append(pos)
    items['Office'].append(office)
    items['Age'].append(age)
    items['Start_Date'].append(start_data)
    items['Salary'].append(salary)

    x = pd.DataFrame(items, columns=['Name','Position','Office','Age',
        'Start_Date','Salary'])

    yield x.to_csv("r",sep=",")

From this code what I get is like this ;

,Name,Position,Office,Age,Start_Date,Salary
0,"{'name': [u'Tiger Nixon',
      u'Garrett Winters',
      u'Ashton Cox',
      u'Cedric Kelly',
      u'Airi Satou',
      u'Brielle Williamson',
      u'Herrod Chandler',

I am getting the names column but I get it 59 times.For instance I have the first row, 'Tiger Nixon' 59 times. I get 59 times position column also and so on. And the scraped data is not in good shape also. I am new to scrapy and open to any help or suggestions. Thanks in advance!

EDIT : My items.py is like this;

import scrapy


class WikiItem(scrapy.Item):


name = scrapy.Field()
position = scrapy.Field()
office = scrapy.Field()
age = scrapy.Field()
start_data = scrapy.Field()
salary = scrapy.Field()

Solution

Ok, I can't comment and I can't test your code because I don't have the definition of WikiItem. But let iterate over this response, ok? Can you check what do you get with this code?

class Spider(scrapy.Spider):

    name = "wiki"
    start_urls = ['https://datatables.net/']

    def parse(self, response):

        trs = response.xpath('//table[@id="example"]//tr')

        if trs:
            items = []
            for tr in trs:
                print tr.xpath('td[2]//text()').extract()
                item = {
                    "Name": tr.xpath('td[1]//text()').extract(),
                    "Position": tr.xpath('td[2]//text()').extract(),
                    "Office": tr.xpath('td[3]//text()').extract(),
                    "Age": tr.xpath('td[4]//text()').extract(),
                    "Start_Date": tr.xpath('td[5]//text()').extract(),
                    "Salary": tr.xpath('td[6]//text()').extract()
                }
                items.append(item)


            x = pd.DataFrame(items, columns=['Name','Position','Office','Age',
                'Start_Date','Salary'])

            yield x.to_csv("r",sep=",")

Answered By - Esteban Martinena Guerrero

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 31, 2023

[FIXED] Construct DataFrame from scraped data using Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels