Web Scrapin’ Focus on Python

Join us here:
https://sandbox.idre.ucla.edu/sandbox/web-scrapin-focus-on-python

Workshop Outline

  1. Introduction: To scrape or not to scrape
    1. [expand title=”Sometimes it’s better never to have scraped at all”]
      • Try asking first!
      • File->Save Page As… is your friend.
      • Consider using a GUI scraping app instead (many require subscriptions, though): Import.io, Portia, Diffbot, Extracty. Good list here.[/expand]
    2. [expand title=”Don’t pay a great deal too dear for what’s given freely”]
      • Does the site have an API? An RSS feed?
        • JSON and XML are much easier to parse than hand-coded or auto-generated HTML.
        • Python (as well as R and other languages) has many modules that are custom-built to scrape specific web sources.
      • Look for bulk data access options (like this), or even just a big “Download” button.[/expand]
  2. The scrape’s afoot: tips and tricks
    1. [expand title=”Best practices”]
      • Find the right HTML elements to target: get used to right-clicking to “inspect element” or using the “View Page Source” menu option (a good target)
      • Consider scholarly open-data requirements: if you can’t publish your results because sharing the scraped data would violate copyright or privacy, that’s a lot of wasted effort.
      • Play nice: when looping, limit requests to a few per minute, or the site may do this for you (and/or block you entirely)
      • stackoverflow.com usually has an answer [/expand]
    2. [expand title=”Scrapable vs. unscrapable sites (and what you can do with the former)”]
  3. Installing Python
    1. Consider all-in one packages like Anaconda, plus lightweight development environments like Jupyter Notebook
    2. Installing PIP
  4. Not so scary up close: Python Basics
    1. [expand title=”Code-and-tell”]
      1. Twitter Scraper
      2. Articles search data and contents scraper[/expand]