Technology

Basic Python Web Crawler (Image Search)

Devon Taylor

Aug 5, 2018 • 7 min read

I wanted to build a web crawler in python to dive into pages and look for images. So of course the first thing I did was google it. After looking through several pages, I stumbled across this simple article. Well, that seems easy enough let's see if we can't build from it.

Here's the original code, for reference:

import requests
from bs4 import BeautifulSoup
def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        s = BeautifulSoup(plain, "html.parser")
        for link in s.findAll('a', {'class':'s-access-detail-page'}):
            tet = link.get('title')
            print(tet)
            tet_2 = link.get('href')
            print(tet_2)
web(1,'http://www.amazon.in/s/ref=s9_acss_bw_cts_VodooFS_T4_w?rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031%2Cp_98%3A10440597031%2Cp_36%3A1500000-99999999&bbn=1805560031&rw_html_to_wsrp=1&pf_rd_m=A1K21FY43GMZF8&pf_rd_s=merchandised-search-3&pf_rd_r=2EKZMFFDEXJ5HE8RVV6E&pf_rd_t=101&pf_rd_p=c92c2f88-469b-4b56-936e-0e65f92eebac&pf_rd_i=1389432031')

Naturally, the first thing I did was pop open a file in Vi, pase the code in, and run it with python3. Naturally, it failed, which was the point. Specifically the error it returned cited a failure on the second line, meaning that I already had the module requests. All that I need is to get BeautifulSoup4, and it should work.

$ sudo apt-get install python3-pip
$ pip3 install beautifulsoup4

I changed the website to my own before I ran it this time. Mostly because it's okay if I break my website: web(1,'https://notawful.org'). And when I ran it...

malachite@localhost:~$ vi crawler_fail.py
malachite@localhost:~$ python3 crawler_fail.py
malachite@localhost:~$

Pros: No errors. Cons: No successes. Now is where I start breaking it down and figuring out how this little script works. The best place to start is probably the documentation, but I found it wholly unhelpful because I have not had the programming experience or patience with tiny words to read through and understand it. My strategy here was to fail fast, and itterate quickly. But first to read the code and approximate what it does.

Well, the entirety of the program is a function, which is cool because I understand those. We pass through page, which was set to 1, and WebUrl which was set to https://notawful.org.

def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        s = BeautifulSoup(plain, "html.parser")
        for link in s.findAll('a', {'class':'s-access-detail-page'}):
            tet = link.get('title')
            print(tet)
            tet_2 = link.get('href')
            print(tet_2)

Stepping through:

if page is greater than zero, do the rest. There is no itterator on page, so it will remain as what we set it as when we called web()
url contains the value of WebUrl. So this should be a string value of https://notawful.org
code contains the object resulting from requests.get(url). I do not know what kind of object this is, but my guess is that it did a get request to url and stored it in code
plain contains the result of code after running the object function/method text on it. I assume this converted whatever requests.get(url) to plaintext, probably the plaintext html.
s contains BeautifulSoup(plain, "html.parser"). I do know that BeautifulSoup turns HTML into objects that can be searched and operated on. I also know this statement is creating a BeautifulSoup object and storing it in s This however, I don't understand.
for each link, something we're just creating right now, in s.findAll('a',{'class':'s-access-detail-page'}) we are going to do a thing.

Alright, I can assume that steps one through five worked as intended, since it is just shifting around object values and if any of those operations were to fail I would get some kind of invalid type error. The next couple of steps operate on link, which will be a new object for each itteration of this link stepping through the values of s.findAll('a',{'class':'s-access-detail-page'}).

The smart thing to do at this point would have been to look at the documentation for what arguments findAll() took, and surely I am a smart person. Well, galaxy brain: I did not.

So, from link we are looking to get('title') and get('href'). These are going to be strings, because we use print() on them. link is an itterated object containing the results of findAll(), which is presumably limiting its search to objects containing both properties 'a' and {'class':'s-access-detail-page'}. Or, perhaps its looking for objects that have property 'a' that contains {'class':'s-access-detail-page'}. Either way, I opened the Chrome Developer console to see if I can spot anything.

A picture of the chrome developer tab. Relevant contents below.

<a class="post-card-content-link" href="/conference-travel-advice/">...</a>

Well then, that clears that up. s.findAll('a',{'class':'s-access-detail-page'}) is looking through s for something that is similar to this, except that instead of post-card-content-link it is looking for one that contains s-access-detail-page. I know this is the one that I am looking for because we printed `link.get('href'), which would be the value of href. It doesn't appear that this one contains a title field, so I'll remove that. So let's make some changes to that script.

Now we are running:

import requests
from bs4 import BeautifulSoup
def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        s = BeautifulSoup(plain, "html.parser")
        for link in s.findAll('a', {'class':'post-card-content-link'}):
            tet = link.get('href')
            print(tet)
web(1,'https://notawful.org')

And this is what we got back:

malachite@localhost:~$ python3 crawler_small.py
/college-university-guide/
/fixing-ghost-on-digitalocean/
/conference-travel-advice/
/gogs-install/
/online-presence-on-a-budget/
/sr-part2/
/sr-part1/
/strava/
/defcon-travel-advice/
/windows-pki/
/engineering-journals/
/why-i-hate-mathematicians/
/lessons-learned-october-2017/
/password-managers/
malachite@localhost:~$

So I hit the nail on the head with my initial instinct. We also know that we are pulling exactly the right tag in all of the html document because we see /conference-travel-advice/ in that list, which was the one I used as a reference.

Something else I noticed is that those hrefs contain a leading /, which means that we can concatenate the href and our WebUrl to get the full url of a child page. This gives us the ability to iterate and run a second loop to pull things from child pages.

To start opening up child pages, I copied the object creation so we could run findAll() on it and just added _low to the end of all the variables. I made sure that I printed the href value so that I could see what page the images I pulled were from. Here's a snippet from that:

for link in s.findAll('a',{'class':'post-card-content-link'}):
    tet = link.get('href')
    print("href " + tet)
    url_low = WebUrl + tet
    code_low = requests.get(url_low)
    plain_low = code_low.text
    s_low = BeautifulSoup(plain_low, "html.parser")

Now, before I ran this I needed to know what the tags for images looked like. The only page that I know has lots of images is /windows-pki/.
Screencap of https://notawful.org/windows-pki/, highlighting an image and it's associated tag (tag content below).

<img src="/content/images/2018/07/PKI-DomainMap.jpg" alt="PKI-DomainMap">

This makes it very easy. My s_low.findAll() is only going to have to look for img, and then I will have to print the value of src. So here's what the final code looks like:

import requests
from bs4 import BeautifulSoup
def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        s = BeautifulSoup(plain, "html.parser")
        for link in s.findAll('a',{'class':'post-card-content-link'}):
            tet = link.get('href')
            print("href " + tet)
            url_low = WebUrl + tet
            code_low = requests.get(url_low)
            plain_low = code_low.text
            s_low = BeautifulSoup(plain_low, "html.parser")
            for img in s_low.findAll('img'):
                tet_low = img.get('src')
                print(tet_low)

web(1,'https://notawful.org')

I ran it, and it worked, but there was a lot of the same image so I piped the results through uniq. Here's what those results looked like:

malachite@localhost:~$ python3 crawler.py | uniq
href /college-university-guide/
/content/images/2018/07/headshot.jpg
href /fixing-ghost-on-digitalocean/
https://assets.digitalocean.com/articles/pdocs/site/control-panel/networking/domains/no-domains-yet.png
https://assets.digitalocean.com/articles/putty_do_keys/new_ssh_key_prompt.png
https://assets.digitalocean.com/articles/pdocs/site/control-panel/droplets/n-choose-image.png
https://pbs.twimg.com/media/DaMLUoGXUAI21V6.jpg:large
/content/images/2018/07/headshot.jpg
href /conference-travel-advice/
/content/images/2018/07/headshot.jpg
...

From no crawler to a one-level image crawler in next to no time. I spent more time on iteration than I showed here. Most of that extra time was spent on getting python to work since I had never had to install pip before, and I installed the wrong pip about three times. I also tried a couple different ways to make web crawlers, but this one I decided was the easiest since I had to build all the moving pieces myself. Building all of tiny pieces myself made me learn how all the pieces fit together, which is how I learn best.

This is what failing fast means to me, particularly. I made one or two changes here and there, test it, watch where it fails, and look into what those failures mean. Repeat until it works as intended. All in all, even with the issues with pip and watnot this took me about an hour and a half to figure out without knowing anything about web page design or having seen any of these packages before.

Support the Author

Devon Taylor (They/Them) is a Canadian network architect, security consultant, and blogger. They have experience developing secure network and active directory implementations in low-budget and low-personnel environments. Their blog offers a unique and detailed perspective on security and game design, and they tweet about technology, security, games, and social issues. You can support their work via Patreon (USD), or directly via ko-fi.

Support the Author

Sign up for more like this.