Join us here:
https://sandbox.idre.ucla.edu/sandbox/web-scrapin-focus-on-python

Workshop Outline

Introduction: To scrape or not to scrape
1. [expand title=”Sometimes it’s better never to have scraped at all”]
  - Try asking first!
  - File->Save Page As… is your friend.
  - Consider using a GUI scraping app instead (many require subscriptions, though): Import.io, Portia, Diffbot, Extracty. Good list here.[/expand]
2. [expand title=”Don’t pay a great deal too dear for what’s given freely”]
  - Does the site have an API? An RSS feed?
    - JSON and XML are much easier to parse than hand-coded or auto-generated HTML.
    - Python (as well as R and other languages) has many modules that are custom-built to scrape specific web sources.
  - Look for bulk data access options (like this), or even just a big “Download” button.[/expand]
The scrape’s afoot: tips and tricks
1. [expand title=”Best practices”]
  - Find the right HTML elements to target: get used to right-clicking to “inspect element” or using the “View Page Source” menu option (a good target)
  - Consider scholarly open-data requirements: if you can’t publish your results because sharing the scraped data would violate copyright or privacy, that’s a lot of wasted effort.
  - Play nice: when looping, limit requests to a few per minute, or the site may do this for you (and/or block you entirely)
  - stackoverflow.com usually has an answer [/expand]
2. [expand title=”Scrapable vs. unscrapable sites (and what you can do with the former)”]
  - Built to facilitate &, even encourage machine access:
    - http://mcd.iosphe.re/ — Manga Covers Database (has an API): after scraping, can make this
  - Tabular formats good, but sometimes there’s a download option
    - https://data.lacounty.gov/browse?category=Health&utf8=%E2%9C%93 –
    - Also the example from this tutorial
  - Actively discourage scraping:
    - http://ctext.org/ — will block your IP address indefinitely if you read too quickly [/expand]
Installing Python
1. Consider all-in one packages like Anaconda, plus lightweight development environments like Jupyter Notebook
2. Installing PIP
Not so scary up close: Python Basics
1. [expand title=”Code-and-tell”]
  1. Twitter Scraper
  2. Articles search data and contents scraper[/expand]

Time Permitting: Advanced Python
- Python Cookbook Recipes UCLA
  - Pete’s Recipes
  - Albert’s Recipes
- Free workshop time

Join us here: https://sandbox.idre.ucla.edu/sandbox/web-scrapin-focus-on-python

Workshop Outline

Join us here:
https://sandbox.idre.ucla.edu/sandbox/web-scrapin-focus-on-python