Start a new python script in either your preferred text editor or Python IDE.
To work with the web, our python script needs to send out data, one way to do this is using the built-in “request” module. Bring it in using the following:
import requests
The only extra library we will be using is lxml, which is a module used to parse and search through xml.
You can either import the whole module with “import lxml” or you can take the classes we are interested in which is the html and etree. For later, we will also need defaultdict from collections too. So the code should look like this:
from lxml import html,etree
from collections import defaultdict
Now, let’s use the request by setting a variable and getting a page:
We could avoid page =but we want to use the page we get, so we have to set it to a variable; otherwise it would just be sitting around doing nothing.
Let’s use the module we imported and print the result:
print page.content
What does your script return?
In looking at our url http://www.google.com/search?tbm=isch&q=ucla , we can see that q=ucla is our search term, but we are going to make it a variable called search term and add it to the url, like so:
import requests,os,csv
from lxml import html,etree
from collections import defaultdict
searchTerm = 'ucla'
page = requests.get('http://www.google.com/search?tbm=isch&q='+searchTerm)
print page.content
Part II – Let’s us Nodes what is going on
Next we want to use the tree and html modules, because that allows us to search nodes:
tree = html.fromstring(page.content)
html.fromstring allows us to search the page as a tree, rather than simply text, but how do we search?
We have to use the “xpath” module from the tree.
Let’s use an example by grabbing all the links from the page:
for node in tree.xpath('//a/@href'):
print node
What this does is for all the nodes in tree.xpath , it looks for all the ‘a’ anchors that have an ‘href’ attribute and prints the node. With xpath, you can traverse down a tree and grab the children with each ‘/’ character.
[expand title=”How would you store the result into a variable?”]
links = tree.xpath('//a/@href')
[/expand]
We are only going to grab the images, so let’s use the img tag.
images = tree.xpath('//img')
This should work, but it will return the actual images.. That’s no good! We actually want the image sources, so we just need to add another node to get that attribute called @src
images = tree.xpath('//img/@src')
Feel free to print it out if you’d like.
Your code should look like this:
import requests,os,csv
from lxml import html,etree
from collections import defaultdict
searchTerm = 'ucla'
page = requests.get('http://www.google.com/search?tbm=isch&q='+searchTerm)
tree = html.fromstring(page.content)
images = tree.xpath('//img/@src')
print images
Step 3 – Let’s close the loop
Remember that a loop in Python has the following components:
for something in someArray:
#do something<br>
for [simple_tooltip content=’This is an alias for indivudal items’]something[/simple_tooltip] in [simple_tooltip content=’this is a list of items’]someArray[/simple_tooltip]:
[simple_tooltip content=’Here is where you have code to do an action with each item’]#do something[/simple_tooltip]<br>
For example:
someArray = [0,1,2,3,4,5]
for number in someArray:
print number
In order to do a loop, something needs be looped over, in this case it is some array of numbers.
Let’s add this code to our Google scraper:
#set an array for the image contents
theData = []
#loop for the content
for source in images:
theData.append(source)
print source
In Line 2, we are creating an empty to store our image source contents, and calling it theData . In line 5, we begin our for-loop. Then we add our content to theData array using theData.append(source) . We can print each clean source if we wanted using print source.
Step 4 Put it into a CSV
Now that we have everything cleaned up, we are going to write it to a csv file:
Line 1 gives us the target file, which is going to be our very own “searchTerm”.csv. Line 2 targets our file with csv.writer() , and assigns it to “writer”.
We can check our images with the following:
print 'number of images found:',len(images)
Finally, we will do our last for-loop and close the csv file:
#write the data into the csv file
for row in theData:
writer.writerow([row])
print row
targetFile.close()
print "done with script"
Make sure that “row” is within brackets or the csv file will be weird!
Congratulations you have scraped a Google page!!
Extra step: Let’s add some user input
But wait, instead of searching for “ucla” all the time, we can have the user input a search term by doing the following:
#ask the user for the search term
searchterm = str(raw_input("What is the search term? \n"))
Your final code should look like the following:
import requests,os,csv
from lxml import html,etree
from collections import defaultdict
#ask the user for the search term
searchTerm = str(raw_input("What is the search term? \n"))
page = requests.get('http://www.google.com/search?tbm=isch&q='+searchTerm)
tree = html.fromstring(page.content)
images = tree.xpath('//img/@src')
print images
#set an array for the image contents
theData = []
#loop for the image source
for source in images:
cleanSource = source.encode('utf-8')
theData.append(source)
print source
#define the target csv
targetFile = open("./"+searchTerm+".csv", "wb")
writer = csv.writer(targetFile)
#write the data into the csv file
for row in theData:
writer.writerow([row])
print row
print 'number of images found:',len(images)
targetFile.close()
print "done with script"
Python and the Web
Part I – Setting up the script
Start a new python script in either your preferred text editor or Python IDE.
To work with the web, our python script needs to send out data, one way to do this is using the built-in “request” module. Bring it in using the following:
The only extra library we will be using is lxml, which is a module used to parse and search through xml.
You can either import the whole module with “import lxml” or you can take the classes we are interested in which is the html and etree. For later, we will also need defaultdict from collections too. So the code should look like this:
Now, let’s use the request by setting a variable and getting a page:
We could avoid page = but we want to use the page we get, so we have to set it to a variable; otherwise it would just be sitting around doing nothing.
Let’s use the module we imported and print the result:
What does your script return?
In looking at our url http://www.google.com/search?tbm=isch&q=ucla , we can see that q=ucla is our search term, but we are going to make it a variable called search term and add it to the url, like so:
Your code should look similar to this:
Part II – Let’s us Nodes what is going on
Next we want to use the tree and html modules, because that allows us to search nodes:
html.fromstring allows us to search the page as a tree, rather than simply text, but how do we search?
We have to use the “xpath” module from the tree.
Let’s use an example by grabbing all the links from the page:
What this does is for all the nodes in tree.xpath , it looks for all the ‘a’ anchors that have an ‘href’ attribute and prints the node. With xpath, you can traverse down a tree and grab the children with each ‘/’ character.
[expand title=”How would you store the result into a variable?”]
[/expand]
We are only going to grab the images, so let’s use the img tag.
This should work, but it will return the actual images.. That’s no good! We actually want the image sources, so we just need to add another node to get that attribute called @src
Feel free to print it out if you’d like.
Your code should look like this:
Step 3 – Let’s close the loop
Remember that a loop in Python has the following components:
for [simple_tooltip content=’This is an alias for indivudal items’]something[/simple_tooltip] in [simple_tooltip content=’this is a list of items’]someArray[/simple_tooltip]:
[simple_tooltip content=’Here is where you have code to do an action with each item’]#do something[/simple_tooltip]<br>
For example:
In order to do a loop, something needs be looped over, in this case it is some array of numbers.
Let’s add this code to our Google scraper:
In Line 2, we are creating an empty to store our image source contents, and calling it theData . In line 5, we begin our for-loop. Then we add our content to theData array using theData.append(source) . We can print each clean source if we wanted using print source.
Step 4 Put it into a CSV
Now that we have everything cleaned up, we are going to write it to a csv file:
Line 1 gives us the target file, which is going to be our very own “searchTerm”.csv. Line 2 targets our file with csv.writer() , and assigns it to “writer”.
We can check our images with the following:
Finally, we will do our last for-loop and close the csv file:
Make sure that “row” is within brackets or the csv file will be weird!
Congratulations you have scraped a Google page!!
Extra step: Let’s add some user input
But wait, instead of searching for “ucla” all the time, we can have the user input a search term by doing the following:
Your final code should look like the following:
Back to the workshop