Alright, so I wanted to work on my last crawler that did a couple new things. Here are the goals for this one.

  • Use lxml's etree html parser over bs4
  • Separate into functions / clean up code a bit
  • Cut out duplicate entries

I do not understand xpaths, but that's the idea with this change is to learn about them. So, let's start with some raw guesses and some googling. I initially had some trouble, but thankfully stackoverflow came to the rescue.

import requests
import lxml
from lxml import etree

htmlparser = etree.HTMLParser()

def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        # Using a new html parser
        tree = etree.fromstring(plain, htmlparser)
        for link in tree.xpath('//a/@href'):
        	print(link)
if __name__ == '__main__':
	web(1,'https://notawful.org')

The result:

malachite@localhost:~/python/scraper3$ python3 scraper3-test.py
https://notawful.org/
https://notawful.org/about/
https://notawful.org/content/images/2018/07/devon_taylor_resume.pdf
https://notawful.org/reading-list/
https://www.patreon.com/notawful
https://ko-fi.com/notawful
https://twitter.com/awfulyprideful
https://feedly.com/i/subscription/feed/https://notawful.org/rss/
/hotkeys-shortcuts-commands/
/author/notawful/
/data-security-while-traveling/
/author/notawful/
/installing-and-troubleshooting-netbox/

Alright, so from here what I want to do is cut out duplicates and keep only the links to pages under my domain. To cut out the duplicates we are going to create a set() which we will dump our list of links into. This will automatically remove any duplicate entries. I don't want my web crawler to wander from my web page, so I am going to need to filter out anything that doesn't start with /.

def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        tree = etree.fromstring(plain, htmlparser)
        pages = set()
        na_tasks = []
        for link in tree.xpath("//a/@href"):
            if link.startswith('/'):
                na_tasks.append(link)
        pages.update(na_tasks)
        for entry in pages:
            print(WebUrl + entry)

And the results:

malachite@localhost:~/python/scraper3$ python3 scraper3-test.py
https://notawful.org/password-managers/
https://notawful.org/fixing-ghost-on-digitalocean/
https://notawful.org/sr-part2/
https://notawful.org/data-security-while-traveling/
https://notawful.org/lessons-learned-october-2017/
https://notawful.org/gogs-install/
https://notawful.org/college-university-guide/
https://notawful.org/installing-and-troubleshooting-netbox/
https://notawful.org/online-presence-on-a-budget/
...

That is exactly the list that I was looking to get. Now, I might want to structure this script a little differently so that I can reuse the same function again. Let's try that next. Also I am going to reduce another line down a little.

def make_request(url):
    resp = etree.fromstring(requests.get(url).text, htmlparser)
    return resp

def get_images(page,WebUrl):
    if(page>0):
        pages = set()
        na_tasks = []
        tree = make_request(WebUrl)
        for link in tree.xpath("//a/@href"):
            if link.startswith('/'):
                na_tasks.append(link)
        pages.update(na_tasks)
        for entry in pages:
            print(WebUrl + entry)

That looks cleaner. It produces the same result as above. So now I need to start digging for images on other pages. We're going to basically duplicate the xpath search, except instead of "//a/@href" we'll be looking for "//img/@src" instead. To make things a little cleaner in case I wanted to pipe the list of images to a downloader, I am going to separate remote and local images and append the root URL to the local images. Here's what the function ends up looking like:

def get_images(page,WebUrl):
    if(page>0):
        pages = set()
        na_tasks = []
        tree = make_request(WebUrl)
        for link in tree.xpath("//a/@href"):
            if link.startswith('/'):
                na_tasks.append(link)
        pages.update(na_tasks)
        image_local = []
        image_remote = []
        image_set = set()
        for entry in pages:
            req = WebUrl+entry
            resp = make_request(req)
            for img in resp.xpath('//img/@src'):
                if img.startswith('/'):
                    image_local.append(WebUrl+img)
                else:
                    image_remote.append(img)
        image_set.update(image_local)
        image_set.update(image_remote)
        for entry in image_set:
            print(entry)

And the results are exactly what I wanted to see, as well. Here are some of those:

malachite@localhost:~/python/scraper3$ python3 scraper3-test2.py
https://assets.digitalocean.com/articles/pdocs/site/control-panel/networking/domains/no-domains-yet.png
https://notawful.org/content/images/2018/07/role-services-window.png
https://notawful.org/content/images/2018/07/rootCDP.png
https://assets.digitalocean.com/articles/putty_do_keys/new_ssh_key_prompt.png
https://notawful.org/content/images/2018/07/certtemplatecomputer.png
https://notawful.org/content/images/2018/08/webcrawl1-chromeconsole.png
https://pbs.twimg.com/media/DUp9AABW0AA82cI.jpg:large
https://notawful.org/content/images/2018/07/submitarequest.png
https://notawful.org/content/images/2018/07/rootCAProperties.png
https://notawful.org/content/images/2018/07/rootCAName.png
https://notawful.org/content/images/2018/07/revoked.png
https://notawful.org/content/images/2018/07/adcsconfigurationwindow.png
https://imgs.xkcd.com/comics/security.png
...

This script took a little while to run because it had to make all the requests one by one and wait for each result before it could move onto the next step. If I were to continue iterating on this I would have to make this script make asynchronous requests. *wink*.

All in all, changing over the HTML parser and cleaning up the script took about two and a half hours. List of things that I used that I was not aware of when I started making this post:

Here's the final code for this post:

import requests
import lxml
from lxml import etree

htmlparser = etree.HTMLParser()

def make_request(url):
    resp = etree.fromstring(requests.get(url).text, htmlparser)
    return resp

def get_images(page,WebUrl):
    if(page>0):
        pages = set()
        na_tasks = []
        tree = make_request(WebUrl)
        for link in tree.xpath("//a/@href"):
            if link.startswith('/'):
                na_tasks.append(link)
        pages.update(na_tasks)
        image_local = []
        image_remote = []
        image_set = set()
        for entry in pages:
            req = WebUrl+entry
            resp = make_request(req)
            for img in resp.xpath('//img/@src'):
                if img.startswith('/'):
                    image_local.append(WebUrl+img)
                else:
                    image_remote.append(img)
        image_set.update(image_local)
        image_set.update(image_remote)
        for entry in image_set:
            print(entry)

if __name__ == '__main__':
    get_images(1,'https://notawful.org')

Support the Author

Devon Taylor (They/Them) is a Canadian network architect, security consultant, and blogger. They have experience developing secure network and active directory implementations in low-budget and low-personnel environments. Their blog offers a unique and detailed perspective on security and game design, and they tweet about technology, security, games, and social issues. You can support their work via Patreon (USD), or directly via ko-fi.