Issue
With the following url and soup, I have the following and I seek to webscrape the Subdivision Information Section. I have copied the html portion for one house below:
house_url = 'https://www.har.com/homedetail/2701-main-st-1910-houston-tx-77002/15331551'
house_response = requests.get(url=house_url, headers=your_header)
house_soup = BeautifulSoup(house_response.text, 'html.parser').find('div', {'class':'pt-2 pb-2 mr-4 pr-md-5 ml-4 pl-md-5'})
Subdivision Section HTML
<div id="subDivisonInfo" class="lazy" data-contentname="subdivision-facts"><div class="mb-5 pb-5 border-bottom border-color--cement_light">
<h2 tabindex="0">Subdivision Facts</h2>
<a class="font_weight--bold font_size--large mr-4" href="/geomarketarea/100_midtown---houston">View Neighborhood Profile </a>
<div class="mb-5 mt-4 pb-3">
<a href="/geomarketarea/100_midtown---houston">
<div class="mb-3 border_radius--round image" style="height: 360px; width: 100%; background-size: cover; background-repeat: no-repeat; background-position: center center; background-image: url("https://api.mapbox.com/styles/v1/mapbox/streets-v11/static/path-1+0000ff-0.45+0000ff-0.45(u%7CstDfobeQnCbExAhAdCx%40xBXvDDlA%40lMF%7ECBvBEhJOVAnDo%40dDiClByCpCkEh%60%40sv%40zx%40vi%40rGhEx%40jAXF%5ElAB%60%40Bx%40Dx%40%3FfBPlBLxA%60%40fL%5EtMNhGFlBJdEDxAHlANdCHbBPpDDpA%3FhDFzE%40xB%40zADbBa%40sB%5Bk%40q%40gA%7DA%7DAw%40k%40yAcAqBwAyAgAiBuAyCyBoBuAmA%7B%40u%40i%40%7BBaB%7BC%7BAsBc%40kF%5D%7BE%3F%7BDHqCIuESk%40%3FCvCAv%40%3FtCJ%7EK%3FlA%7DLB%3F%7B%40yJB_K%40_E%3FwJ%40mX%40yA%40sIFgM%40%3FkDOcCm%40_IeBgG%3F%3FaAyDs%40_EWgBGsASuDQiNGoNKwJ%3F%3FQuH%7EClJjCfG)/auto/651x360?access_token=pk.eyJ1IjoiaGFyZGV2ZXJpY2siLCJhIjoiY2sxZ3FuNWJpMDFtbDNjbDJ0bnJnbnpkdyJ9.byj8yrbalnyCw4u9TNwYuA");">
<img class="img-fluid img-loader" src="https://content.harstatic.com/img/common/loading1.gif" style="display: none;">
</div>
<script type="text/javascript">
/*! domready (c) Dustin Diaz 2014 - License MIT */
;!function(e,t){"undefined"!=typeof module?module.exports=t():"function"==typeof define&&"object"==typeof define.amd?define(t):this.domready=t()}(0,function(){var e,t=[],o="object"==typeof document&&document,n=o&&o.documentElement.doScroll,d=o&&(n?/^loaded|^c/:/^loaded|^i|^c/).test(o.readyState);return!d&&o&&o.addEventListener("DOMContentLoaded",e=function(){for(o.removeEventListener("DOMContentLoaded",e),d=1;e=t.shift();)e()}),function(e){d?setTimeout(e,0):t.push(e)}});
</script>
<script type="text/javascript">
domready(function() {
HARMap.load().then(function(module) {
var componentId = 'image24906579';
var polygon = 'POLYGON((-95.372842651 29.762188072,-95.373816894 29.761474216,-95.374191992 29.76101599,-95.374483311 29.760352652,-95.374609142 29.759738202,-95.374640089 29.758819359,-95.374647572 29.758426112,-95.374691499 29.756117694,-95.374706659 29.75532102,-95.3746799 29.754718062,-95.374599516 29.752906601,-95.374594347 29.752790096,-95.374351409 29.751909331,-95.373661419 29.751078253,-95.372887724 29.750528187,-95.371869687 29.74980439,-95.362970840876 29.744465093909,-95.369806756 29.735213416,-95.370819903 29.733833779,-95.371197558 29.733537028,-95.371239769 29.733411918,-95.371629671 29.733245349,-95.371804383 29.73323255,-95.372090663 29.733211576,-95.372379167 29.733175792,-95.372896911 29.733184661,-95.373448298 29.733085864,-95.373897555 29.73302357,-95.376020952 29.732848991,-95.378367141 29.732692501,-95.379698574 29.732605591,-95.380251649 29.73256989,-95.381236514 29.732506316,-95.381686127 29.732477294,-95.382077218 29.732432106,-95.382753475 29.732353969,-95.383254109 29.732299798,-95.384141891 29.732214015,-95.384547373 29.732184995,-95.385401694 29.732177061,-95.386504599 29.7321448,-95.387113815 29.732128476,-95.38757473 29.732116124,-95.388065951 29.732085741,-95.387487144 29.732263493,-95.387266095 29.732397447,-95.386907035 29.732649085,-95.386438833 29.733120089,-95.386221661 29.733399038,-95.385875385 29.733846224,-95.385438799 29.734415138,-95.38508092 29.734869849,-95.384651063 29.735397347,-95.384037746 29.736172341,-95.383612227 29.736729284,-95.383311768 29.737122539,-95.383099783 29.737389038,-95.382608073 29.738007194,-95.382145211 29.738793482,-95.381972784 29.739372769,-95.381824272 29.740550975,-95.381818116 29.741650627,-95.381867014 29.742586972,-95.381820616 29.743316543,-95.38172399 29.744387766,-95.381721351 29.744610716,-95.382483057 29.744631286,-95.382763059 29.744636527,-95.383509646 29.74463557,-95.385588363 29.744584135,-95.385979746 29.744575193,-95.386003821 29.746807942,-95.385699411 29.746810253,-95.385718338 29.748696241,-95.385725089 29.750624573,-95.385732125 29.75158054,-95.385735183 29.753459289,-95.385748282 29.757531403,-95.385758824 29.757976236,-95.385799269 29.75968271,-95.385808634 29.76196431,-95.384952318 29.761964306,-95.384292125 29.762037035,-95.382689529 29.76226967,-95.381370964 29.762781324,-95.381373603 29.762782609,-95.380443808 29.763114063,-95.379476498 29.763374528,-95.378959294 29.763485938,-95.378540344 29.763528353,-95.377633441 29.763629227,-95.375180147 29.763724096,-95.37270141 29.763764494,-95.370824522 29.763815179,-95.370818039 29.76381558,-95.369274785 29.76391082,-95.371101581 29.763114639,-95.372416403 29.762414919,-95.372842651 29.762188072))';
var node = $('.' + componentId).removeClass(componentId);
// var result = module.StaticMap.custom.withPolygon(node.width(), node.height(), polygon)
// result.backgroundImage(node);
var result = module.StaticMap.custom.withPolygon(node.width(), node.height(), polygon)
result.backgroundImage(node);
/*var geometry = module.geometry;
var points = geometry.pointsFromWKT(polygon);
//console.log(points);
if(points.length > 100) { points = geometry.simplifyPolygon(points, 0.0001); }
if(points.length > 100) { points = geometry.simplifyPolygon(points, 0.001); }
//console.log(points);
var encString = geometry.encodePath(points);
var width = node.width();
var height = node.height();
if(!width) { console.error('width cannot be empty!'); }
if(!height) { console.error('height cannot be empty!'); }
var path = encodeURIComponent("weight:1|fillcolor:blue|enc:" + encString);
var url = "/api/staticmap?size="+ width +"x"+ height +"&path="+ path + "&client=gme-houstonrealtorsinformation";
// alert(url);
//$(node).html('<a class="pointer" href="'+url+'" id="hoodMapStaticLink"></a><img />');
var image = new Image();
image.onload = image.onerror = function() { node.find('img').remove(); }
image.src = url;
$(node).css('background-image', 'url(' + url + ')');*/
});
});
</script> </a>
</div>
<h3 class="mt-5 pb-3" tabindex="0">Facts (Based on Active listings)</h3>
<div class="row">
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Market Area Name</div>
<div class="font_size--large font_weight--regular">Midtown - Houston</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Home For Sales</div>
<div class="font_size--large font_weight--regular">104</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average List Price</div>
<div class="font_size--large font_weight--regular">$428,844</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average Bedrooms</div>
<div class="font_size--large font_weight--regular">2.27</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average Baths</div>
<div class="font_size--large font_weight--regular">2.07</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average Sqft</div>
<div class="font_size--large font_weight--regular">1,873</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average Price/Sqft</div>
<div class="font_size--large font_weight--regular">$236.48</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Home For Lease</div>
<div class="font_size--large font_weight--regular">96</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average Lease</div>
<div class="font_size--large font_weight--regular">$2,396</div>
</div>
<div class="col-md-4 col-6 mb-4">
<div class="font_weight--bold font_size--small_extra">Average Lease/Sqft</div>
<div class="font_size--large font_weight--regular">$1.76</div>
</div>
</div>
</div>
</div>
However, whenever I use beautifulSoup to get the text such as "Average List Price:$428,844", This is the output I get:
house_soup.find('div',{'id':'subDivisonInfo'}).find('div',{'class':'row'}).findAll('div',{'class':'col-md-4 col-6 mb-4'})[0].getText()
'\n-----------\n-----------\n'
I am not sure why it is returning this string instead of the actual text?
Solution
The required data is loaded from external source via AJAX.So you have to use API url instead.
import requests
from bs4 import BeautifulSoup
api_url= 'https://www.har.com/api/getSubdivisionFacts/15331551'
req=requests.get(api_url).text
#print(req)
soup= BeautifulSoup(req,'lxml')
price = soup.select_one('[class="col-md-4 col-6 mb-4"] > div:-soup-contains("Average List Price")').find_next_sibling('div')
print(price.text)
Output:
$428,844
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.