Reading 17
Webscraping, What is it?
Webscraping is a method of finding and harvesting volumes of desired data from a website. This is often accomplished using a programming language to automate the process by collecting and parsing the information.
First, A Warning
Webscraping must be done responsibly. Check the terms and conditions of the website to ensure that you are not breaking any rules (or laws). Also, webscraping can be overload a website with requests. Ensure that your requests are spaced out enough to keep the website unburdened. Doing so will also prevent you from being flagged as a spammer.
Techniques
- robot.txt files - can typically be found in rooturl/robots.txt. Lays out the rules of the road for a web site.
- Slow your roll - treat the website nicely, use auto throttling to mimmick human behaviors
- Change up your patterns - humans don’t follow the same patterns all the time. Use random clicks, mouse movements.
- Change up your IP Address - VPNs, TOR, Free and Shared Proxies
sources:
- https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
- https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460
- https://www.youtube.com/watch?v=Bg9r_yLk7VY