[expand title=”Sometimes it’s better never to have scraped at all”]
Try asking first!
File->Save Page As… is your friend.
Consider using a GUI scraping app instead (many require subscriptions, though): Import.io, Portia, Diffbot, Extracty. Good list here.[/expand]
[expand title=”Don’t pay a great deal too dear for what’s given freely”]
Does the site have an API? An RSS feed?
JSON and XML are much easier to parse than hand-coded or auto-generated HTML.
Python (as well as R and other languages) has many modules that are custom-built to scrape specific web sources.
Look for bulk data access options (like this), or even just a big “Download” button.[/expand]
The scrape’s afoot: tips and tricks
[expand title=”Best practices”]
Find the right HTML elements to target: get used to right-clicking to “inspect element” or using the “View Page Source” menu option (a good target)
Consider scholarly open-data requirements: if you can’t publish your results because sharing the scraped data would violate copyright or privacy, that’s a lot of wasted effort.
Play nice: when looping, limit requests to a few per minute, or the site may do this for you (and/or block you entirely)
Web Scrapin’ Focus on Python
Join us here:
https://sandbox.idre.ucla.edu/sandbox/web-scrapin-focus-on-python
Workshop Outline