Web Scraping and Data Mining for GMs, Part 1

A Brief Introduction

I'm planning on running an AD&D 2nd edition campaign, and I'd like to be able to programatically access information about monsters in my world. TSR released a bunch of monsters for 2nd edition in their Monstrous Compendium series. I own most of these books in pdf, and could theoretically grab all the monsters information from those PDFs, but there is an easier way. These days, web scraping is an essential skill for programmers, and sometimes, for Game Masters. If, like me, you're a GM with some basic programming skill, you can scrape a lot of useful data.

At one time, lomion.de was where you went if you wanted to see AD&D 2nd edition monster information online. Unfortunately, that site is no longer in existence. Enter The Wayback Machine: a project by archive.org to make older websites accessible. They have an archive of lomion.de, and a very liberal scraping policy. This guide will show you how to grab this information and store it on your hard drive. Once it's there, we can use Python to parse those files and put them in a more useful form (like a SQLite database). That'll be part 2. Let's get scraping.

Just a note about the legality of this: given that this stuff is all copyrighted material of Wizards of the Coast, it was likely illegal for it to be online like this in the first place. Archive.org is on good legal standing as this appears to be fair use: they are providing information about what this stie was like when it was extant. Me demonstrating how to scrape it is also informational, and is likely covered under fair use. You using this data in the privacy of your home is on murkier ground. But never use this data for any monetary gain. That is very illegal. I am not advocating copyright infringement, nor will I ever advocate copyright infringement. Many people at TSR worked very hard on the Monstrous Compendiums. You can purchase them on DriveThruRPG and I encourage you to do so if you're going to be using this data.

What You Need

A Python interpreter. I'm using Python 3.11, but this should work for most recent versions of Python.
The following python modules: Beautiful Soup 4, Requests, loguru (optional), lxml (optional)
Jupyter Notebook (optional, but recommended)
wget

You could use a text editor rather than Jupyter Notebook, but I much prefer the interactive nature of Jupyter for this type of work. I can test things before I commit to a strategy. wget is available for most platforms, and might already be installed if you're using Linux. For everyone else, instructions for installation can be found on the web very easily. In order to not clutter up my main python environment, I used a python virtual environment for this project. I recommend you do the same. In most cases it's as easy as:

python3 -m venv [path/to/environment]

Once created, you have to activate it. I'm working on windows, and the virtual environment path is .venv so to activate, I type:

.venv\Scripts\activate.bat

Now I'm in a clean virtual environment and can install the modules I need.

pip install requests pip install BeautifulSoup4 pip install lxml pip install loguru pip install Jupyter Notebook
You can find more about virtual environments in python here

Why these modules

Requests is the most commonly used library for working with http requests in Python. It's the defacto standard. Similarly, Beautiful Soup is the defacto standard for parsing html (though there are many others). I install lxml because that's the parsing engine I use with Beautiful Soup, but you don't have to use that if you have a preference for a different parser. Loguru is a powerful logging library for Python. I spent most of my development years using print to debug programs, but these days I use logging and loguru is my preferred logging library. But if you don't want to log, then don't install loguru.

loguru in action

Why wget

wget is one of many ways to download web pages and web sites. I could use requests and beautiful soup to do this, but wget already exists and I see no reason to reinvent the wheel. Why download at all when I can parse the html right on the server? Well, because I make mistakes, and I'll have to try different strategies to properly parse the HTML on the site. That means I'll keep hitting the site over and over to properly get the information. That's a waste of my bandwidth and a waste of archive.org's server resources. I'm going to have to download the information anyhow: I might as well get it all in one fell swoop and then parse it at my leisure later on.

How to start

So let's take a look at the index of monsters.

The first thing you might notice is the toolbar at the top. It'd be nice if we could get rid of that. We won't need it. If you click on any of the links on the page, you'll be led to a monster description. The monster descriptions are all in the same directory as the index. That's handy. It'll make building the URLs to scrape very easy. Let's look at a monster description.

The monstrous compendium listing for Skeletons

The monsters stats are in a table, which should make scraping it easier later on. The page has images, which is handy, and we'll grab them too.

So let's see if we can deal with the toolbar. I did some searching, and discovered that Wikipedia keeps some information on dealing with the Wayback Machine. That's handy, and the solution is there. If you attach 'if_' to the datetime object in the url, the toolbar goes away. Excellent.

The Basic Steps

We want to: Get a list of the monsters Create the URLs to the pages we want to download Download the pages and store them on our local storage, including all relevant files like images, javascript, and css.

I'll be the first to admit that I rarely use the subprocess module in Python. In fact, I can't remember the last time I did. But a little research led me to this site, which describes the process with wget, which is what we'll need. Like I said, I don't like reinventing the wheel, so I'll just use their technique (and their nifty runcmd function). If you're unfamiliar, the subprocess module lets you work with operating system subprocesses. It can start programs, send them things via standard input, and receive output of the programs, amongst other things. I'll be using it to call wget.

If you're unfamiliar with wget, it can seem daunting. There are a lot of options. I used the wget dirty manual to figure out what options I wanted. In my case, those options are as follows:

-nc - No clobber. If the transfer gets interrupted, and I have to start over, no clobber prevents wget from downloading and overriding the stuff that's already downloaded.
-k - This will convert the links in the downloaded documents so that they work when stored locally
--user-agent="PythonGrabMonsterScript" - Not necessary, but this identifies the script to the server that it's talking to. I just consider it polite.
-p - This downloads page requisites: things like css and javascript files, along with images.
--restrict-file-names=windows - Makes sure any downloaded filenames work in Windows.

And now we have all the pieces to build the progarm. The program I've built is below. It is under the MIT license (except the runcmd function, which isn't mine.) I am not repsonsible for what you do with this, but please don't use it to break the law in any way. Note that wget will recreate the original directory structure of the server to the best of its ability, so finding the files might be a pain. The main file you need to find after this is done is _index.php. On my machine, this is under web.archive.org\web\20180818101608if_\http%3A\lomion.de\cmm. Keep in mind that this will take a long time (it took about 12 hours on my machine). It takes up about 200 MB on disk.

import requests
import time
import sys
from loguru import logger
from bs4 import BeautifulSoup
import subprocess

def runcmd(cmd, verbose = False, *args, **kwargs):
    """Run a shell command from within python.  Credit to Scrapingbee.com (https://www.scrapingbee.com/blog/python-wget/)
    """
    process = subprocess.Popen(
        cmd,
        stdout = subprocess.PIPE,
        stderr = subprocess.PIPE,
        text = True,
        shell = True
    )
    std_out, std_err = process.communicate()
    if verbose:
        logger.debug(std_out.strip())
    pass


# set up logging 

logger.add("DnDLog.txt", rotation="100 MB")

# grab the index
indexURL='https://web.archive.org/web/20180818101608if_/http://lomion.de/cmm/_index.php'

try:
    response=requests.get(indexURL)
    logger.success("Grabbed {}".format(indexURL))
except:
    logger.error('Cannot grab url {}'.format(indexURL))

if response.status_code != 200:
    logger.error('Received status code {}.  Cannot continue'.format(response.status_code))
    sys.exit("Exiting on error")

# Now that we have the index, let's grab the monster list from the index using beautiful soup.  This will
# help us build the URLS to donwload

theHTML=response.text

soup = BeautifulSoup(theHTML, 'lxml')

monsterURLS=[]

for link in soup.find_all('a', href=True):
    monsterURL=link['href']
    if '.php' in monsterURL and '_' not in monsterURL:
        monsterURLS.append(monsterURL)

numMonsters=len(monsterURLS)
logger.debug('Retrieved {} monster names'.format(numMonsters))



# Download the index, this time for storage purposes
wgetCommand='wget -nc -k --user-agent="PythonGrabMonsterScript" -p --restrict-file-names=windows'
indexCommand='{0} {1}'.format(wgetCommand, indexURL)
logger.debug(indexCommand)
logger.debug("Downloading index")
runcmd(indexCommand, verbose = True)
logger.debug("Index downloaded")

# and finally, let's grab all the monsters

baseURL='https://web.archive.org/web/20180818101608if_/http://lomion.de/cmm/'

logger.debug ('Starting Monsters----')

monsterURLS.sort()
monsterCount=0
for i in monsterURLS:
    monsterCount+=1
    time.sleep(1)  # let's be nice
    monsterURL="{0}{1}".format(baseURL, i)
    monsterCommand="{0} {1}".format(wgetCommand, monsterURL)
    logger.debug("Grabbing {0} from {1}: {2}/{3}".format(i, monsterURL, monsterCount, numMonsters))
    runcmd(monsterCommand, verbose = True)
    logger.debug("{0} done".format(i))

logger.debug("\n\nProgram finished")