Issue
I am trying to get data from a table on a website using scrapy. The first column in the table has a few fields with a "ROWSPAN" of various lengths. When I run my crawl code, the table output gets messed up because the code does not recognizes the data in column 1 for all the rows covered by the original "ROWSPAN" statement.
I have created an html file mimicking the table I'm trying to crawl, and what I am trying to accomplish. As you see in the html output, the "Family" column spans all rows containing the same family members. But the scrapy code output only shows the correct Family title on the first entry for each Family.
Scrapy Code:
import scrapy
class FamilyTableSpider(scrapy.Spider):
name = 'familytable'
allowed_domains = ['127.0.0.1'] #domain changed to protect the innocent
start_urls = ['127.0.0.1'] #url changed to protect the innocent
def start_requests(self):
urls = [
'127.0.0.1', #url changed to protect the innocent
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.xpath('//*[@class="table table-striped table-bordered"]//tbody/tr'):
yield {
'Family' : row.xpath('td[1]//text()').extract_first(),
'Name': row.xpath('td[2]//text()').extract_first(),
'Relationship' : row.xpath('td[3]//text()').extract(),
'Age' : row.xpath('td[3]//text()').extract_first(),
}
"""run code with: scrapy crawl familytable -O familytable.json"""
html code:
<html>
<body>
<table width="50%" border="0" cellspacing="10" cellpadding="0" class="table table-striped table-bordered">
<thead>
<tr>
<th align="center" valign="top" width="25%">Family</th><th align="left" valign="top" width="25%">Name</th>
<th align="left" valign="top" width="25%">Relationship</th>
<th align="left" valign="top" width="25%">Age</th>
</tr>
</thead>
<tbody>
<tr class="linebottom">
<td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="5">Smith</td>
<td align="left" valign="top">Thomas</a></td>
<td align="left" valign="top">Father<br>Husband<br></td><td align="left" valign="top">58</td>
</tr>
<tr class="linebottom"><td align="left" valign="top">Mary</a></td>
<td align="left" valign="top">Mother<br>Wife<br></td><td align="left" valign="top">57</td>
</tr>
<tr class="linebottom">
<td align="left" valign="top">Joe</a></td>
<td align="left" valign="top">Son<br></td><td align="left" valign="top">18</td>
</tr>
<tr class="linebottom">
<td align="left" valign="top">Sue</a></td>
<td align="left" valign="top">Daughter<br></td><td align="left" valign="top">16</td>
</tr>
<tr class="linebottom">
<td align="left" valign="top">Tommy</a></td>
<td align="left" valign="top">Son<br></td><td align="left" valign="top">13</td>
</tr>
<tr class="linebottom">
<td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="4">Jones</td>
<td align="left" valign="top">James</a></td>
<td align="left" valign="top">Father<br>Husband<br></td><td align="left" valign="top">42</td>
</tr>
<tr class="linebottom"><td align="left" valign="top">Linda</a></td>
<td align="left" valign="top">Mother<br>Wife<br></td><td align="left" valign="top">42</td>
</tr>
<tr class="linebottom"><td align="left" valign="top">Anthony</a></td>
<td align="left" valign="top">Son</td><td align="left" valign="top">14</td>
</tr>
<tr class="linebottom"><td align="left" valign="top">Jeff</a></td>
<td align="left" valign="top">Son</td><td align="left" valign="top">11</td>
</tr>
<tr class="linebottom">
<td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="2">Johnson</td>
<td align="left" valign="top">Stephen</a></td>
<td align="left" valign="top">Husband</td><td align="left" valign="top">29</td>
</tr>
<tr class="linebottom">
<td align="left" valign="top">Samantha</a></td>
<td align="left" valign="top">Wife</td><td align="left" valign="top">28</td>
</tr>
</tbody>
</table>
</body>
</html>
Solution
You can either use xpath selectors to get all the tags between the tr tag <td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="...">
including the tr itself, and scrape whatever you want.
Another solution is to use the rowspan
attribute. It tells you how much lines there are for each family (see the example).
import scrapy
class FamilyTableSpider(scrapy.Spider):
name = 'familytable'
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
counter = 0
rowspans = []
for first_tag in response.xpath('//*[@class="table table-striped table-bordered"]//tbody/tr/td[@align="center"]'):
rowspans.append(int(first_tag.xpath('.//@rowspan').get(default='0')))
rows = response.xpath('//*[@class="table table-striped table-bordered"]//tbody/tr')
for rowspan in rowspans:
family = rows[counter].xpath('./td[1]//text()').get(default='')
for i in range(rowspan):
index = counter + i
family_member = {
'Family': family,
'Name': rows[index].xpath('./td[last()-2]//text()').get(),
'Relationship': ', '.join(rows[index].xpath('./td[last()-1]//text()').getall()),
'Age': rows[index].xpath('./td[last()]//text()').get(),
}
yield family_member
counter += rowspan
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.