Issue
Within the implementation of a Python Scrapy crawler I would like to add a robust mechanism for monitoring/detecting potential layout changes within a website.
These changes do not necessarily affect existing spider selectors - for example, a site adds a new HTML element to represent the number of visitors an item has received - an element I might now be interested in parsing. Having said that, detecting selector issues (Xpath/CSS) would be also beneficial in case where they are removed/relocated.
Please note this is not about selector content change or a website refresh (if-modified-since
or last-modified
), but rather a modification in the structure / nodes / layout of a site.
Therefore, how would one implement logic to monitor such circumstances?
Solution
This is actually a topic for research as you can see on this paper but there are of course some implemented tools that you can check out:
- https://github.com/matiskay/html-similarity
- https://github.com/matiskay/html-cluster
- https://github.com/TeamHG-Memex/page-compare
Basically the base for comparing (on the previous approaches) is to use the Tree Edit Distance of the html layout.
Answered By - eLRuLL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.