Issue
I'm quite new to scrapy and looking to scrape a website with Scrapy for my research project.
The website in question has a number of classified listings on each page and on each page there is a honeypot listing that does not show when viewing from the browser (div with HappySpinoffs class in the code below), but when one inspects the DOM the listing is there but just hidden with CSS properties in a CSS block in the HTML (not in-line). I've inspected the HTML and there is no difference between the honeypot listing and the other listings on the page - the only difference being the CSS properties right above it in the HTML document. When I scrape the pages with Scrapy the Xpath selectors pick up on the honeypot listing and the bot gets blocked. The class names are dynamically generated and the position of the honeypot listing changes on each page. Looking at the CSS code-block below, only the honeypot listing's class is used - the others are just to throw one off.
I'm currently getting the listings through the following xpath '/div[contains(@class, "js_resultTile")'
but then it catches the honeypot listing. I don't know how to grab all those css classes via xpath and check those to the listings I get so that the honeypot listing isn't scraped. Given that there is roughly 500,000 listings and that these need to be updated weekly, the solution must be super quick.
The HTML:
<div class="js_listingResultsContainer">
<div class="b34_promotedTile js_resultTile js_pseudoLinkContainer js_rollover_container HappyReacting" data-listing-number="P108146928">...</div>
<div class="b34_promotedTile js_resultTile js_pseudoLinkContainer js_rollover_container HappyMorrow" data-listing-number="P108079642">...</div>
<div class="b34_promotedTile js_resultTile js_pseudoLinkContainer js_rollover_container HappyPumping" data-listing-number="P107587584">...</div>
<div class="b34_promotedTile js_resultTile js_pseudoLinkContainer js_rollover_container HappyBudgeted" data-listing-number="P108129532">...</div>
<div class="b34_promotedTile js_resultTile js_pseudoLinkContainer js_rollover_container HappyDormant" data-listing-number="P107692442">...</div>
<div class="HappyMistimed js_resultTile" data-listing-number="106933717">...</div>
<div class="HappySalivas js_resultTile" data-listing-number="108171874">...</div>
<div class="HappyInanity js_resultTile" data-listing-number="108168952">...</div>
<div class="HappyMiss js_resultTile" data-listing-number="108168914">...</div>
<div class="HappyRevolver js_resultTile" data-listing-number="108138404">...</div>
<div class="HappyMongrel js_groupedResultTile" data-listing-number="108165172">...</div>
<div class="HappyMexicans js_groupedResultTile" data-listing-number="108111893">...</div>
<div class="HappyScaling js_resultTile" data-listing-number="108131862">...</div>
<div class="HappyJacob js_resultTile" data-listing-number="108108694">...</div>
<div class="HappyWhelp js_resultTile" data-listing-number="108152564">...</div>
<div class="HappyCome js_resultTile" data-listing-number="108163034">...</div>
<div class="HappyBrawler js_resultTile" data-listing-number="108153616">...</div>
<div class="HappySpinoffs js_resultTile" data-listing-number="107969187">...</div>
<div class="HappyDrug js_resultTile" data-listing-number="108117622">...</div>
<div class="HappyBecalmed js_resultTile" data-listing-number="108146204">...</div>
<div class="HappyInfante js_resultTile" data-listing-number="108134673">...</div>
</div>
The CSS properties further up in the HTML of the page (not external CSS file):
<style type="text/css">
.HappySpinoffs
{
position: absolute;
left: -6541px;
}
.HappyDefying
{
position: absolute;
left: -9018px;
}
.HappyBenefit
{
position: absolute;
left: -6421px;
}
.HappyAssert
{
left: -7575px;
position: absolute;
}
.HappyForswore
{
position: absolute;
left: -7694px;
}
.HappySmiler
{
left: -5308px;
position: absolute;
}
</style>
}
Solution
If you want to exclude some div
s by class
:
'/div[contains(@class, "js_resultTile")][not(contains(@class, "js_pseudoLinkContainer"))]'
UPDATE Then you need to parse honeypot's CSS first:
honeypots = response.xpath('//style[@some_selectors_here]/text()').re(r'\.(\S+)\s+\{')
Next you need to get class
for each div
you have:
for listing_div in response.xpath('//div[@class="js_listingResultsContainer"]/div'):
div_class = listing_div.xpath('./@class').re_first(r'(\S+)$')
if div_class not in honeypots:
# process a link here
Answered By - gangabass
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.