Welcome to Brannon's Reading Notes!

This serves as my personal page to keep and update my reading notes for Code Fellows Courses 201, 301, and 401.

Reading 17

Webscraping, What is it?

Webscraping is a method of finding and harvesting volumes of desired data from a website. This is often accomplished using a programming language to automate the process by collecting and parsing the information.

First, A Warning

Webscraping must be done responsibly. Check the terms and conditions of the website to ensure that you are not breaking any rules (or laws). Also, webscraping can be overload a website with requests. Ensure that your requests are spaced out enough to keep the website unburdened. Doing so will also prevent you from being flagged as a spammer.

Techniques

robot.txt files - can typically be found in rooturl/robots.txt. Lays out the rules of the road for a web site.
Slow your roll - treat the website nicely, use auto throttling to mimmick human behaviors
Change up your patterns - humans don’t follow the same patterns all the time. Use random clicks, mouse movements.
Change up your IP Address - VPNs, TOR, Free and Shared Proxies

sources:

https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460
https://www.youtube.com/watch?v=Bg9r_yLk7VY