Issue
What is a good way to compare new data with old data which is updated everyday with Django ORM? Basically I have a scraper which fetches hackathons everyday (basically just a celery task) and I want the newest to be unioned it with my master database which has the latest fetched hackathons from yesterday. I don't want to destroy my master database and then just upload everything that I just fetched since that seems wasteful.
Solution
This more seems data comparison and no need to save celery task in db. Ideal use case to save celery task in db is when that task it self needs to be scheduled.
For data comparison you can make use of hash (MD5, SSA1 etc). This will speed up the data comparison.
- for existing records in db, create one column to store hash of that entire record. use algorithm of your choice MD5, SHA1, SHA224, SHA256, Snefru etc to hash.
- when new data is received/processed by celery task, create a hash of that record too.
- now compare this hash created in 2nd step with hash of all exiting records.
- if there's a match found then data already exists in master.
Answered By - Poonam Adhav
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.